OLTP – DBMS 2 : DataBase Management System Services

Introduction to SequoiaDB and SequoiaCM

Curt Monash — Sun, 12 Mar 2017 18:19:49 +0000

For starters, let me say:

SequoiaDB, the company, is my client.
SequoiaDB, the product, is the main product of SequoiaDB, the company.
SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
… and is usually sold for RDBMS-like use cases …
… except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
SequoiaDB’s products are open source.
SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.

Also:

SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
- SequoiaDB’s founders were Chinese nationals.
- Most of them went back to China.
- Other employees to date have been entirely Chinese.
- Sales to date have been entirely in China, but SequoiaDB has international aspirations
SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)

Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.

While I usually think that the advantages of open source are overstated, in SequoiaDB’s case open source will have* an additional benefit when SequoiaDB does go international — it addresses any concerns somebody might have about using Chinese technology.

*Edit: Actually, this claim is overstated based on SequoiaDB’s current open source practices. Please see the comment thread below.

SequoiaDB’s technology story starts:

SequoiaDB is a layered DBMS.
It manages JSON via update-in-place. MVCC (Multi-Version Concurrency Control) is on the roadmap.
Indexes are B-tree.
Transparent sharding and elasticity happen in what by now is the industry-standard/best-practices way:
- There are many (typically 4096) logical partitions, many of which are assigned to each physical partition.
- If the number of physical partitions changes, logical partitions are reassigned accordingly.
Relational OLTP (OnLine Transaction Processing) functionality is achieved by using a kind of PostgreSQL front end.
Relational batch processing is done via SparkSQL.
There also is a block/LOB (Large OBject) storage engine meant for content management applications.
SequoiaCM boils down technically to:
- SequoiaDB, which is used to store JSON metadata about the LOBs …
- … and whose generic-DBMS coordination capabilities are also used over the block/LOB engine.
- A Java library focused on content management.

SequoiaDB’s relationship with PostgreSQL is complicated, but as best I understand SequoiaDB’s relational operations:

SQL parsing, optimization, and so on rely mainly on PostgreSQL code. (Of course, there are some hacks, such as to the optimizer’s cost functions.)
Actual data storage is done via SequoiaDB’s JSON store, using PostgreSQL Foreign Data Wrappers. Each record goes in a separate JSON document. Locks, commits and so on — i.e. “write prevention” — are handled by the JSON store.
PostgreSQL’s own storage engine is actually part of the stack, but only to manage temp space and the like.

PostgreSQL stored procedures are already in the SequoiaDB product. Triggers and referential integrity are not. Neither, so far as I can tell, are PostgreSQL’s datatype extensibility capabilities.

I neglected to ask how much of that remains true when SparkSQL is invoked.

SequoiaDB’s use cases to date seem to fall mainly into three groups:

Content management via SequoiaCM.
“Operational data lakes”.
Pretty generic replacement of legacy RDBMS.

Internet back-ends, however — and this is somewhat counter-intuitive for an open-source JSON store — are rare, at least among paying subscription customers. But SequoiaDB did tell me of one classic IoT (Internet of Things) application, with lots of devices “phoning home” and the results immediately feeding a JSON-based dashboard.

To understand SequoiaDB’s “operational data lake” story, it helps to understand the typical state of data warehousing at SequoiaDB’s customers and prospects, which isn’t great:

2-3 years of data, and not all the data even from that time period.
Only enough processing power to support structured business intelligence …
… and hence little opportunity for ad-hoc query.

SequoiaDB operational data lakes offer multiple improvements over that scenario:

They hold as much relational data as customers choose to dump there.
That data can be simply copied from operational stores, with no transformation.
Or if data arrives via JSON — from external organizations or micro-services as the case may be — the JSON can be stored unmodified as well.
Queries can be run straight against this data soup.
Of course, views can also be set up in advance to help with querying.

Views are particularly useful with what might be called slowly changing schemas. (I didn’t check whether what SequoiaDB is talking about matches precisely with the more common term “slowly changing dimensions”.) Each time the schema changes, a new table is created in SequoiaDB to receive copies of the data. If one wants to query against the parts of the database structure that didn’t change — well, a view can be establish to allow for that.

Finally, it seems that SequoiaCM uses are concentrated in what might be called “security and checking-up” areas, such:

Photographs as part of an authentication process.
Video of in-person banking transactions, both for fraud prevention and for general service quality assurance.
Storage of security videos (for example from automated teller machines).

SequoiaCM deals seem to be bigger than other SequoiaDB ones, surely in part because the amounts of data managed are larger.

Readings in Database Systems

Curt Monash — Thu, 10 Dec 2015 12:26:40 +0000

Mike Stonebraker and Larry Ellison have numerous things in common. If nothing else:

They’re both titanic figures in the database industry.
They both gave me testimonials on the home page of my business website.
They both have been known to use the present tense when the future tense would be more accurate.

I mention the latter because there’s a new edition of Readings in Database Systems, aka the Red Book, available online, courtesy of Mike, Joe Hellerstein and Peter Bailis. Besides the recommended-reading academic papers themselves, there are 12 survey articles by the editors, and an occasional response where, for example, editors disagree. Whether or not one chooses to tackle the papers themselves — and I in fact have not dived into them — the commentary is of great interest.

But I would not take every word as the gospel truth, especially when academics describe what they see as commercial market realities. In particular, as per my quip in the first paragraph, the data warehouse market has not yet gone to the extremes that Mike suggests,* if indeed it ever will. And while Joe is close to correct when he says that the company Essbase was acquired by Oracle, what actually happened is that Arbor Software, which made Essbase, merged with Hyperion Software, and the latter was eventually indeed bought by the giant of Redwood Shores.**

*When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend.

**Let me interrupt my tweaking of very smart people to confess that my own commentary on the Oracle/Hyperion deal was not, in retrospect, especially prescient.

Mike pretty much opened the discussion with a blistering attack against hierarchical data models such as JSON or XML. To a first approximation, his views might be summarized as:

Logical hierarchical models can be OK in certain cases. In particular, JSON could be a somewhat useful datatype in an RDBMS.
Physical hierarchical models are horrible.
Rather, you should implement the logical hierarchical model over a columnar RDBMS.

My responses start:

Nested data structures are more important than Mike’s discussion seems to suggest.
Native XML and JSON stores are apt to have an index on every field. If you squint, that index looks a lot like a column store.
Even NoSQL stores should and I think in most cases will have some kind of SQL-like DML (Data Manipulation Language). In particular, there should be some ability to do joins, because total denormalization is not always a good choice.

In no particular order, here are some other thoughts about or inspired by the survey articles in Readings in Database Systems, 5th Edition.

I agree that OLTP (OnLine Transaction Processing) is transitioning to main memory.
I agree with the emphasis on “data in motion”.
While I needle him for overstating the speed of the transition, Mike is right that columnar architectures are winning for analytics. (Or you could say they’ve won, if you recognize that mop-up from the victory will still take 1 or 2 decades.)
The guys seem to really hate MapReduce, which is an old story for Mike, but a bit of a reversal for Joe.
MapReduce is many things, but it’s not a data model, and it’s also not something that Hadoop 1.0 was an alternative to. Saying each of those things was sloppy writing.
The guys characterize consistency/transaction isolation as a rather ghastly mess. That part was an eye-opener.
Mike is a big fan of arrays. I suspect he’s right in general, although I also suspect he’s overrating SciDB. I also think he’s somewhat overrating the market penetration of cube stores, aka MOLAP.
The point about Hadoop (in particular) and modern technologies in general showing the way to modularization of DBMS is an excellent one.
Joe and Mike disagreed about analytics; Joe’s approach rang truer for me. My own opinion is:
- Business intelligence has been important for quite a while, and won’t stop.
- Machine learning is becoming ever more important.
- It’s still early days for the integration of the two areas, but much more will come.
The challenge of whether anybody wants to do machine learning (or other advanced analytics) over a DBMS is sidestepped in part by the previously mentioned point about the modularization of a DBMS. Hadoop, for example, can be both an OK analytic DBMS (although not fully competitive with mature, dedicated products) and of course also an advanced analytics framework.
Similarly, except in the short-term I’m not worried about the limitations of Spark’s persistence mechanisms. Almost every commercial distribution of Spark I can think of is part of a package that also contains a more mature data store.
Versatile DBMS and analytic frameworks suffer strategic contention for memory, with different parts of the system wanting to use it in different ways. Raising that as a concern about the integration of analytic DBMS with advanced analytic frameworks is valid.
I used to overrate the importance of abstract datatypes, in large part due to Mike’s influence. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. A big part of the problem is what I mentioned in the previous point — different parts of a versatile DBMS would prefer to do different things with memory.
I used to overrate the importance of user-defined functions in an analytic RDBMS. Mike had nothing to do with my error. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. Looser coupling between analytics and data management seems more flexible.
Excellent points are made about the difficulties of “First we build the perfect schema” data warehouse projects and, similarly, MDM (Master Data Management).
There’s an interesting discussion that helps explain why optimizer progress is so slow (both for the industry in general and for each individual product).

Related links

I did a deep dive into MarkLogic’s indexing strategy in 2008, which informed my comment about XML/JSON stores above.
Again with MarkLogic as the focus, in 2010 I was skeptical about document stores not offering joins. MarkLogic has since capitulated.
I’m not current on SciDB, but I did write a bit about it in 2010.
I’m surprised that I can’t find a post to point to about modularization of DBMS. I’ll leave this here as a placeholder until I can.
Edit: As promised, I’ve now posted about the object-relational/abstract datatype boom of the 1990s.

Data models

Curt Monash — Mon, 23 Feb 2015 03:08:10 +0000

7-10 years ago, I repeatedly argued the viewpoints:

Relational DBMS were the right choice in most cases.
Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
There were a variety of specialized use cases in which non-relational data models were best.

Since then, however:

Hadoop has flourished.
NoSQL has flourished.
Graph DBMS have matured somewhat.
Much of the action has shifted to machine-generated data, of which there are many kinds.

So it’s probably best to revisit all that in a somewhat organized way.

To make the subject somewhat manageable, I’ll focus on fielded data — i.e. data that represents values of something — rather than, for example, video or images. Fielded data always arrives as a string of bits, whose meaning boils down to a set of pairs. Here by “string of bits” I mean mainly a single record or document (for example), although most of what I say can apply to a whole stream of data instead.

Important distinctions include:

Are the field names implicit or explicit? In relational use cases field names tend to be implicit, governed by the metadata. In some log files they may be space-savingly implicit as well. In other logs, XML streams, JSON streams and so on they are explicit.
If the field names are implicit, is any processing needed to recover them? Think Hadoop or Splunk acting on “dumb-looking” log data.
In any one record/document/whatever, are the field names unique? If not, then the current data model is not relational.
Are the field names the same from one record/document/whatever to the next? I.e., does the data fit into a consistent schema?
Is there a structure connecting the field names (and if so what kind)? E.g., hierarchical documents, or relational foreign keys.

Some major data models can be put into a fairly strict ordering of query desirability by noting:

The best thing to query is a relational DBMS. Everything has a known field name, so SELECTs are straightforward. You also have JOINs, which are commonly very valuable. And RDBMS are a mature technology with in many cases great query performance.
The next-best thing to query is another kind of data store with known field names. In such data stores:
- SQL or SQL-like SELECTs will still work, or can easily be made to do.
- Useful indexing systems can be grafted on to them (although they are typically less mature than in RDBMS).
- In the (mainly) future, perhaps JOINs can be grafted on as well.
The worst thing to query is a data store in which you only have a schema on read. You have to do work to make the thing queryable in the first place.

Unsurprisingly, that ordering is reversed when it comes to writing data.

The easiest thing to write to is a data store with no structure.
Next-easiest is to write to a data store that lets you make up the structure as you go along.
The hardest thing to write to is a relational DBMS, because of the requirements that must be obeyed, notably:
- Implicit field names, governed by metadata.
- Unique field names within any one record.
- The same (ordered) set of field names for each record — more precisely, a limited collection of such ordered sets, one per table.

And so, for starters, most large enterprises will have important use cases for data stores in all of the obvious categories. In particular:

Usually it is best to have separate brands of general-purpose/OLTP (OnLine Transaction Processing) and analytic RDBMS. Further:
- I have in the past also advocated for a mid-range — i.e. lighter-weight — general purpose RDBMS.
- SAP really, really wants you to use HANA to run SAP’s apps.
- You might want an in-memory RDBMS (MemSQL) or a particularly cloudy one or whatever.
Your website alone is reason enough to use a NoSQL DBMS, most likely MongoDB or Cassandra. And it often makes sense to have multiple NoSQL systems used for different purposes, because:
- They’re all immature right now, with advantages over each other.
- The apps you’re using them for are likely to be thrown out in a few years, so you won’t have great pain switching if you ever do decide to standardize.
Whatever else Hadoop is — and it’s a lot of things — it’s also a happy home for log files. And enterprises have lots of log files.

Beyond that:

You may want something to manage organizational hierarchies and so on, if you build enough custom systems in areas such as security, knowledge management, or MDM (Master Data Management). I’m increasingly persuaded by the argument that this should be a graph DBMS rather than an LDAP (Lightweight Directory Access Protocol) system.
Splunk is cool.
Use cases for various other kinds of data stores can often be found.
Of course you’ll be implicitly using whatever is bundled into your SaaS (Software as a Service) systems, your app-specific appliances and so on.

And finally, I think in-memory data grids:

Will be widely used and important.
Will be used to instantiate multiple data models at once.

Related links

One reason for writing this post was for some deck-clearing before I revisit the white-hot topic of data streaming. (October, 2014)
I’ve long mused about the challenges of getting by without joins. (November, 2010)
In 2013 I observed that data models will be in perpetual, rapid flux.
In 2013 I also discussed attempts to combine multiple data models (or access methods) in a single DBMS.
I surveyed data models and access methods back in 2008.

Thoughts and notes, Thanksgiving weekend 2014

Curt Monash — Mon, 01 Dec 2014 01:48:43 +0000

I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:

1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:

Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,

The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.

What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.

2. Three years ago I posted about agile (predictive) analytics. One of the points was:

… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.

Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.

3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with:

Also, the flashiest application I know of for only-moderately-successful KXEN came when one or more large retailers decided to run separate models for each of thousands of stores.

4. MongoDB, the product, has been refactored to support pluggable storage engines. In connection with that, MongoDB does/will ship with two storage engines – the traditional one and a new one from WiredTiger (but not TokuMX). Both will be equally supported by MongoDB, the company, although there surely are some tiers of support that will get bounced back to WiredTiger.

WiredTiger has the same techie principals as SleepyKat – get the wordplay?! – which was Mike Olson’s company before Cloudera. When asked, Mike spoke of those techies in remarkably glowing terms.

I wouldn’t be shocked if WiredTiger wound up playing the role for MongoDB that InnoDB played for MySQL. What I mean is that there were a lot of use cases for which the MySQL/MyISAM combination was insufficiently serious, but InnoDB turned MySQL into a respectable DBMS.

5. Hadoop’s traditional data distribution story goes something like:

Data lives on every non-special Hadoop node that does processing.
This gives the advantage of parallel data scans.
Sometimes data locality works well; sometimes it doesn’t.
Of course, if the output of every MapReduce step is persisted to disk, as is the case with Hadoop MapReduce 1, you might create some of your own data locality …
… but Hadoop is getting away from that kind of strict, I/O-intensive processing model.

However, Cloudera has noticed that some large enterprises really, really like to have storage separate from processing. Hence its recent partnership to work with EMC Isilon. Other storage partnerships, as well as a better fit with S3/object storage kinds of environments, are sure to follow, but I have no details to offer at this time.

6. Cloudera’s count of Spark users in its customer base is currently around 60. That includes everything from playing around to full production.

7. Things still seem to be going well at MemSQL, but I didn’t press for any details that I would be free to report.

8. Speaking of MemSQL, one would think that at some point something newer would replace Oracle et al. in the general-purpose RDBMS world, much as Unix and Linux grew to overshadow the powerful, secure, reliable, cumbersome IBM mainframe operating systems. On the other hand:

IBM blew away its mainframe competitors and had pretty close to a monopoly. But Oracle has some close and somewhat newer competitors in DB2 and Microsoft SQL Server. Therefore …
… upstarts have three behemoths to outdo, not just one.
MySQL, PostgreSQL and to some extent Sybase are still around as well.

Also, perhaps no replacement will be needed. If we subdivide the database management world into multiple categories including:

General-purpose RDBMS.
Analytic RDBMS.
NoSQL.
Non-relational analytic data stores (perhaps Hadoop-based).

it’s not obvious that the general-purpose RDBMS category on its own requires any new entrants to ever supplant the current leaders.

All that said – if any of the current new entrants do pull off the feat, SAP HANA is probably the best (longshot) guess to do so, and MemSQL the second-best.

9. If you’re a PostgreSQL user with performance or scalability concerns, you might want to check what Citus Data is doing.

21st Century DBMS success and failure

Curt Monash — Mon, 14 Jul 2014 06:37:31 +0000

As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.

DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.

In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.

Buyer inertia is a greater concern.

A significant minority of enterprises are highly committed to their enterprise DBMS standards.
Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.

A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.

First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.

Otherwise I’d say:

At large enterprises, their internet operations perhaps excepted:
- Short-request/general-purpose SQL alternatives to the behemoths — e.g. MySQL, PostgreSQL, NewSQL — have had tremendous difficulty getting established. The last big success was the rise of Microsoft SQL Server in the 1990s. That’s why I haven’t mentioned the term mid-range DBMS in years.
- NoSQL/non-SQL has penetrated large enterprises mainly for a few specific use cases, for example the lists I posted for MongoDB or graph databases.
Internet-only companies have few inertia issues when it comes to database managers. They’ll consider anything they regard as being in their price ballpark (which is however often restricted to open source). I think part of the reason is that as quickly as they rewrite their applications, DBMS are vastly less “strategic” to them than they are to most larger enterprises.
The internet operations of large companies — especially large retailers — in many cases behave like internet-only companies, but in many other cases behave like the rest of the enterprise.

The major reasons for DBMS categories to get established in the first place are:

Performance and/or scalability (many examples).
Developer features (for example dynamic schema).
License/maintenance cost (for example several open source categories).
Ease of installation and administration (for example open source again, and also data warehouse appliances).

Those same characteristics are major bases for competition among members of a new category, although as noted above behemoth-loyalty can also come into play.

Cool-vs.-weird tradeoffs are somewhat secondary among SQL DBMS.

There’s not much of a “cool” factor, because new products aren’t that different in what they do vs. older ones.
There’s not a terrible “weird” factor either, but of course any smaller offering faces FUD, and also …
… appliances are anti-strategic for many buyers, especially ones who demand a smooth path to the cloud.)

They’re huge, however, in the non-SQL world. Most non-SQL data managers have a major “weird” factor. Fortunately, NoSQL and Hadoop both have huge “cool” cred to offset it. XML/XQuery unfortunately did not.

Finally, in most DBMS categories there are massive issues with product completeness, more in the area of maturity than that of whole product. The biggest whole product issues are concentrated on the matter of interoperating with other software — business intelligence tools, packaged applications (if relevant to the category), etc. Most notably, the handful of DBMS that are certified to run SAP share a huge market that other DBMS can’t touch. But BI tools are less of a differentiator — I yawn when vendors tell me they are certified for/partnered with MicroStrategy, Tableau, Pentaho and Jaspersoft, and I’m surprised at any product that isn’t.

DBMS maturity has a lot of aspects, but the toughest challenges are concentrated in two main areas:

Reliability, especially but not only in short-request use cases.
Performance across a great variety of use cases. I observe frequently that performance in best-case scenarios, performance in the lab and performance in real-world environments are much further apart than vendors like to think.

In particular:

Maturity demands seem to be much higher for SQL DBMS than for NoSQL.
- I think this is one of several reasons NoSQL has been much more successful than NewSQL.
- It’s why I think MarkLogic’s “Enterprise NoSQL” positioning is a mistake.
As for MySQL:
- MySQL wasn’t close to reliable enough for enterprises to trust it until InnoDB became the default storage engine.
- MySQL 5 point releases have added major features, or decent performance for major features. I’ll confess to having lost track of what’s been fixed and what’s still missing.
- In saying all that I’m holding MySQL to a much higher maturity standard than I’m holding NoSQL — because that’s what I think enterprise customers do.
PostgreSQL “should” be doing a lot better than it is. I have an extremely low opinion of its promoters, and not just for personal reasons. (That said, the personal reasons don’t just apply to EnterpriseDB anymore. I’ve also run out of patience waiting for Josh Berkus to retract untruths he posted about me years ago.)
SAP HANA checks boxes for performance (In-memory rah rah rah!!) and whole product (Runs SAP!!). That puts it well ahead of most other newish SQL DBMS, purely analytic ones perhaps excepted.
Any other new short-request SQL DBMS that sounds like is has traction is also memory-centric.
Analytic RDBMS are in most respects held to lower maturity standards than DBMS used for write-intensive workloads. Even so, products in the category are still frequently tripped up by considerations of concurrent performance and mixed workload management.

Related links

There have been 1,470 previous posts in the 9-year history of this blog, many of which could serve as background material for this one. A couple that seem particularly germane and didn’t get already get linked above are:

The drive for uninterrupted DBMS operation.
Short-request DBMS trade-offs and alternatives.

Using multiple data stores

Curt Monash — Wed, 18 Jun 2014 16:03:10 +0000

I’m commonly asked to assess vendor claims of the kind:

“Our system lets you do multiple kinds of processing against one database.”
“Otherwise you’d need two or more data managers to get the job done, which would be a catastrophe of unthinkable proportion.”

So I thought it might be useful to quickly review some of the many ways organizations put multiple data stores to work. As usual, my bottom line is:

The most extreme vendor marketing claims are false.
There are many different choices that make sense in at least some use cases each.

Horses for courses

It’s now widely accepted that different data managers are better for different use cases, based on distinctions such as:

Short-request vs. analytic.
SQL vs. non-SQL (NoSQL or otherwise).
Expensive/heavy-duty vs. cheap/easy-to-support.

Vendors are part of this consensus; already in 2005 I observed

For all practical purposes, there are no DBMS vendors left advocating single-server strategies.

Vendor agreement has become even stronger in the interim, as evidenced by Oracle/MySQL, IBM/Netezza, Oracle’s NoSQL dabblings, and various companies’ Hadoop offerings.

Multiple data stores for a single application

We commonly think of one data manager managing one or more databases, each in support of one or more applications. But the other way around works too; it’s normal for a single application to invoke multiple data stores. Indeed, all but the strictest relational bigots would likely agree:

It’s common and sensible to manage authentication and authorization data in its own data store. Commonly, the data format is LDAP (Lightweight Directory Access Protocol).
It’s common and sensible to manage the “content” and “e-commerce transaction records” aspects of websites separately.
Even beyond that case, there are often performance reasons to manage BLOBs (Binary Large OBjects) outside your relational database.
Internet “interaction” data is also often best managed outside an RDBMS, in part because of its very non-tabular data structures.

The spectacular 2010 JP Morgan Chase outage was largely caused, I believe, by disregard of these precepts.

There also are cases in which applications dutifully get all their data via SQL queries, but send those queries to two or more DBMS. Teradata is proud that its systems can support rather transactional queries (for example in call-center use cases), but the same application may read from and write to a true OTLP database as well.

Further, many OLTP (OnLine Transaction Processing) applications do some fraction of their work via inbound or outbound messaging. Many buzzwords can come into play here, including but not limited to:

SOA (Service-Oriented Architecture). This is the most current and flexible one.
EAI (Enterprise Application Integration). This was a hot concept in the late 1990s, but was generally implemented with difficulties that SOA was later designed to alleviate.
Message-oriented middleware (MOM) and Publish/Subscribe. These are even older, and overlap greatly.

Finally, every dashboard that combines information from different data stores could be assigned to this category as well.

Multiple storage approaches in a single DBMS

In theory, a single DBMS could operate like two or more different ones glued together. A few functions should or must be centralized, such as administration, and communication with the outside world (connection handling, parsing, etc.). But data storage, query execution and so on could for the most part be performed by rather loosely coupled subsystems. And so you might have the best of both worlds — something that’s multiple data stores in the ways you want that diversity, but a single system in how it fits into your environment.

I discussed this idea last year with cautious optimism, writing:

So will these trends succeed? The forgoing caveats notwithstanding, my answers are more Yes than No.

… multi-purpose DBMS will likely always have performance penalties, but over time the penalties should become small enough to be affordable in most cases.

…

Machine-generated data and “content” both call for multi-datatype DBMS. And taken together, those are a large fraction of the future of computing. Consequently …

… strong support for multiple datatypes and DMLs is a must for “general-purpose” RDBMS. Oracle and IBM [have] been working on that for 20 years already, with mixed success. I doubt they’ll get much further without a thorough rewrite, but rewrites happen; one of these decades they’re apt to get it right.

In 2005 I had been more ambivalent, in part because my model was a full 1990s-dream “universal” DBMS:

IBM, Oracle, and Microsoft have all worked out ways to have integrated query parsing and query optimization, while letting storage be more or less separate. More precisely, Oracle actually still sticks everything into one data store (hence the lack of native XML support), but allows near-infinite flexibility in how it is accessed. Microsoft has already had separate servers for tabular data, text, and MOLAP, although like Sybase, it doesn’t have general datatype extensibility that it can expose to customers, or exploit itself to provide a great variety of datatypes. IBM has had Oracle-like extensibility all along, although it hasn’t been quite as aggressive at exploiting it; now it’s introduced a separate-server option for XML.

That covers most of the waterfront, but I’d like to more explicitly acknowledge three trends:

Among other things, Hadoop is a collection of DBMS (HBase, Impala, et al.) that in some cases are very loosely coupled to each other. The question is less how well the various data stores work together, and more how mature any one of them is on its own.
The multiple-data-models idea has been extended into schema-on-need, which is sometimes but not always housed in Hadoop.
Even on the relational side, multiple storage capabilities exist in one product.
- Vertica was designed that way from the get-go. (Like the old joke about police duos, one is to read and one is to write.)
- IBM, Microsoft and Oracle have all recently added some kind of in-memory columnar capability.
- Teradata, Aster (before Teradata bought them), Greenplum and Vertica all added some variant on row/column dual stores.

Related links

SQL vs. NoSQL, legacy vs. clean-up. (March, 2014)
The difficulty of DBMS development, including Hadoop-based ones (March, 2013)

NoSQL vs. NewSQL vs. traditional RDBMS

Curt Monash — Fri, 28 Mar 2014 14:09:37 +0000

I frequently am asked questions that boil down to:

When should one use NoSQL?
When should one use a new SQL product (NewSQL or otherwise)?
When should one use a traditional RDBMS (most likely Oracle, DB2, or SQL Server)?

The details vary with context — e.g. sometimes MySQL is a traditional RDBMS and sometimes it is a new kid — but the general class of questions keeps coming. And that’s just for short-request use cases; similar questions for analytic systems arise even more often.

My general answers start:

Sometimes something isn’t broken, and doesn’t need fixing.
Sometimes something is broken, and still doesn’t need fixing. Legacy decisions that you now regret may not be worth the trouble to change.
Sometimes — especially but not only at smaller enterprises — choices are made for you. If you operate on SaaS, plus perhaps some generic web hosting technology, the whole DBMS discussion may be moot.

In particular, migration away from legacy DBMS raises many issues:

Feature incompatibility (especially in stored-procedure languages and/or other vendor-specific SQL).
Your staff’s programming and administrative skill-sets.
Your investment in DBMS-related tools.
Your supply of hockey tickets from the vendor’s salesman.

Except for the first, those concerns can apply to new applications as well. So if you’re going to use something other than your enterprise-standard RDBMS, you need a good reason.

Commonly, the good reason to change DBMS is one or more of:

Programming model. Increasingly often, dynamic schemas seem preferable to fixed ones. Internet-tracking nested data structures are just one of the reasons.
Performance (scale-out). DBMS written in this century often scale out better than ones written in the previous millennium. Also, DBMS with fewer features find it easier to scale than more complex ones; distributed join performance is a particular challenge.
Geo-distribution. A special kind of scale-out is geo-distribution, which is sometimes a compliance requirement, and in other cases can be a response-time nice-to-have.
Other stack choices. Couchbase gets a lot of its adoption from existing memcached users (although they like to point out that the percentage keeps dropping). HBase gets a lot of its adoption as a Hadoop add-on.
Licensing cost. Duh.

NoSQL products commonly make sense for new applications. NewSQL products, to date, have had a harder time crossing that bar. The chief reasons for the difference are, I think:

Programming model!
Earlier to do a good and differentiated job in scale-out.
Earlier to be at least somewhat mature.

And that brings us to the 762-gigabyte gorilla — in-memory DBMS performance — which is getting all sorts of SAP-driven marketing attention as a potential reason to switch. One can of course put any database in memory, providing only that it is small enough to fit in a single server’s RAM, or else that the DBMS managing it knows how to scale out. Still, there’s a genuine category of “in-memory DBMS/in-memory DBMS features”, principally because:

In-memory database managers can and should have a very different approach to locking and latching than ones that rely on persistent storage.
Not all DBMS are great at scale-out.

But Microsoft has now launched Hekaton, about which I long ago wrote:

I lack detail, but I gather that Hekaton has some serious in-memory DBMS design features. Specifically mentioned were the absence of locking and latching.

My level of knowledge about Hekaton hasn’t improved in the interim; still, it would seem that in-memory short-request database management is not a reason to switch away from Microsoft SQL Server. Oracle has vaguely promised to get to a similar state one of these years as well.

Of course, HANA isn’t really a short-request DBMS; it’s an analytic DBMS that SAP plausibly claims is sufficiently fast and feature-rich for short-request processing as well.* It remains to be seen whether that difference in attitude will drive enough sustainable product advantages to make switching make sense.

*Most obviously, HANA is columnar. And it has various kinds of integrated analytics as well.

Related links

Wants vs. needs (March, 2014)
The refactoring of everything (July, 2013)
Notes on memory-centric data management (January, 2014)
Traditional databases will eventually wind up in RAM (May, 2011)
Coverage of memory-centric DBMS flag-wavers MemSQL, Aerospike, and SAP HANA

RDBMS and their bundle-mates

Curt Monash — Sun, 10 Nov 2013 19:22:48 +0000

Relational DBMS used to be fairly straightforward product suites, which boiled down to:

A big SQL interpreter.
A bunch of administrative and operational tools.
Some very optional add-ons, often including an application development tool.

Now, however, most RDBMS are sold as part of something bigger.

Oracle has hugely thickened its stack, as part of an Innovator’s Solution strategy — hardware, middleware, applications, business intelligence, and more.
IBM has moved aggressively to a bundled “appliance” strategy. Even before that, IBM DB2 long sold much better to committed IBM accounts than as a software-only offering.
Microsoft SQL Server is part of a stack, starting with the Windows operating system.
Sybase was an exception to this rule, with thin(ner) stacks for both Adaptive Server Enterprise and Sybase IQ. But Sybase is now owned by SAP, and increasingly integrated as a business with …
… SAP HANA, which is closely associated with SAP’s applications.
Teradata has always been a hardware/software vendor. The most successful of its analytic DBMS rivals, in some order, are:
- Netezza, a pure appliance vendor, now part of IBM.
- Greenplum, an appliance-mainly vendor for most (not all) of its existence, and in particular now as a part of EMC Pivotal.
- Vertica, more of a software-only vendor than the others, but now owned by and increasingly mainstreamed into hardware vendor HP.
MySQL’s glory years were as part of the “LAMP” stack.
Various thin-stack RDBMS that once were or could have been important market players … aren’t. Examples include Progress OpenEdge, IBM Informix, and the various strays adopted by Actian.

This phenomenon is, I think, much more driven by vendors than users. Most of the examples I listed work or could work perfectly well on their own.* But relational database management systems are seen as “strategic” products, which means in particular:

They’re often expensive to adopt (software, hardware, people costs).
They’re also often expensive to switch away from.

And strategic products, high price tags, and thick product stacks commonly go together.

*Netezza is an exception. But Exadata is not; while Oracle data warehousing was in a bad technical place before Exadata, Exadata software is what cleaned the problem up.

Also relevant is that I took those examples from relatively mature RDBMS market segments — high-end OLTP/general purpose (OnLine Transaction processing), mid-range OLTP/general-purpose, and analytic. Products in those sectors have had enough time to be built out. They also tend to have fairly close competitors, as the most important product features (e.g. columnar storage in analytic RDBMS, or online backup across the board) have been imitated numerous times each.

NewSQL, by way of contrast, is just as thin-stack as NoSQL is. Products in those sectors are immature; vendors are completing them first before wedding them to other technology layers. They’re also strongly differentiated; if you tell me what topology you need and which style(s) of API or DML (Data Manipulation Language) you prefer, the list of product candidates I give you may be short indeed.

HBase is the obvious exception to my “NoSQL products stand alone” generalization, but its market position is a matter of debate.

I have mixed feelings about this trend. For starters, I’m grudgingly becoming more sympathetic to DBMS/hardware bundles, notwithstanding their role as a way to gouge more money from customers than the hardware is actually worth. Why? Because of my opinion that there’s a general move toward appliances, clusters and clouds. In particular:

As DBMS become better at straddling and melding RAM, flash and disk, legitimate reasons to optimize hardware/software integration will increase.
Microsoft (with Parallel Data Warehouse) and SAP (with HANA) induce customers to adopt hardware “appliances” even though they don’t sell and profit from the hardware themselves. This shoots down the argument that appliances are only a vendor trick to squeeze out more profits.
Netezza’s super-easy installation was a really nice feature.

When it comes to RDBMS/business intelligence bundles, my thoughts start:

As a general rule, a benefit of BI is that it can get at data from lots of different sources. This speaks against tying it to a specific DBMS.
The vendor-specific evidence is mixed:
- IBM has never explained any user advantages to including Cognos in its analytic “appliance” product lines.
- Teradata did some special optimizations for MicroStrategy. This suggests that, conversely, MicroStrategy could benefit from DBMS-specific features.
- QlikView built a custom in-memory data store.
- Specialized business intelligence stacks are on the rise, although generally with a beyond-just-relational flavor.

And so I’m skeptical about RDBMS/BI integration, but willing to be persuaded otherwise.

The integration of advanced analytics with RDBMS leaves me perplexed. Gains in performance, scalability and/or development ease would seem, in many cases, too great to pass up. (E.g.. the Teradata Aster 6 story, analytic libraries and all.) And indeed most analytic platform vendors report some level of adoption. But the whole thing is moving more slowly than I expected. Meanwhile in the Hadoop world, a much lesser SQL capability — Hive — seems to be integrated into other analytic processing with enthusiasm. Perhaps the problem is that enterprises have to figure out which analytic techniques to use in the first place, before they worry too much about making them efficient.

And finally, when it comes to bundling of packaged applications with RDBMS — that depends on the class of application.

At the high end, it’s almost purely a pricing ploy, as those apps are usually written for lowest-common-denominator SQL functionality, so as to preserve portability.
A lot of mid-range apps are written against a specific DBMS, which is then resold along with the app. What’s more …
… most of those apps will migrate over time to a SaaS (Software as a Service) delivery model, which allows for a wholly integrated stack. And as the Workday example teaches us, database choices for SaaS apps can be pretty imaginative.

Related links

The refactoring of everything (July, 2013)
Comments about Gartner’s comments about a bunch of DBMS products (November, 2013)
The cardinal rules of DBMS development (March, 2013)

Comments on the 2013 Gartner Magic Quadrant for Operational Database Management Systems

Curt Monash — Fri, 08 Nov 2013 16:46:46 +0000

The 2013 Gartner Magic Quadrant for Operational Database Management Systems is out. “Operational” seems to be Gartner’s term for what I call short-request, in each case the point being that OLTP (OnLine Transaction Processing) is a dubious term when systems omit strict consistency, and when even strictly consistent systems may lack full transactional semantics. As is usually the case with Gartner Magic Quadrants:

I admire the raw research.
The opinions contained are generally reasonable (especially since Merv Adrian joined the Gartner team).
Some of the details are questionable.
There’s generally an excessive focus on Gartner’s perception of vendors’ business skills, and on vendors’ willingness to parrot all the buzzphrases Gartner wants to hear.
The trends Gartner highlights are similar to those I see, although our emphasis may be different, and they may leave some important ones out. (Big omission — support for lightweight analytics integrated into operational applications, one of the more genuine forms of real-time analytics.)

Anyhow:

The 2013 Gartner Magic Quadrant for Operational Database Management Systems puts Oracle in the lead, closely followed in some order Microsoft, SAP, and IBM, with everybody else way behind. That’s reasonable, harkening back to the time when Oracle, IBM, Microsoft and to some extent Sybase were seemingly secure oligopolists, and most of the other vendors mentioned didn’t yet exist.
Gartner seems to view a proprietary appliance strategy as good for customers, without mentioning that it’s also a way to sell hardware at ridiculous prices.
Gartner evidently likes memory-centric positioning. SAP, Aerospike, VoltDB and McObject all get surprisingly high marks.
Gartner gives Intersystems pretty high marks, while Progress Software isn’t even mentioned. Despite Progress’ recent restructuring, I’d think the core Progress OpenEdge business — arguably Intersystems’ closest rival — deserves more respect than that. (But given how rarely I write about it myself, perhaps I shouldn’t criticize.)
Gartner has long been oddly positive on Actian, which is a floundering hodgepodge of half a dozen database also-rans. I like Mike Hoskins a lot too, but just how much has Actian’s supposedly “energized” “strong leadership” accomplished in the recent past, at Actian or elsewhere?
Gartner has brutally low “vision” rankings for NuoDB and Clustrix. I think scaling out SQL effectively is more impressive than that. Gartner also omits to mention Clustrix’s past as an appliance vendor.
Gartner refers to Oracle’s multi-tenancy support as if … well, as if it supported multi-tenancy.
I don’t understand Gartner’s rankings of or comments about NoSQL vendors. For example:
- Three “strengths” are mentioned for MongoDB, yet none reference MongoDB’s developer outreach, which may be second only to prime Microsoft’s.
- HBase is discussed as if the Hadoop vendors were still pushing it hard, or if it were showing up in a lot of enterprise evaluations.
- Geo-distribution is mentioned as a strength for Riak, yet not for Cassandra.
Every Gartner Magic Quadrant (or Forrester Wave) features one or more outright brain cramps. In this one:
- Gartner writes “the Clustrix database offers no support for data types beyond traditional relational types,” when in fact Clustrix was one of the early indicators of a trend toward relational DBMS JSON support.
- Gartner suggests that EnterpriseDB’s Oracle compatibility is something new, when it was actually the company’s whole strategy 6-7 years ago.

Finally, since I’ve struggled with the definition of “DBMS”, I’ll finish by quoting with approval the start of Gartner’s:

We define a DBMS as a complete software system used to define, create, manage, update and query a database.

Related links

Comments on the most recent Gartner Magic Quadrant for Data Warehouse Database Management Systems
My definition of operational analytics

Thoughts on in-memory columnar add-ons

Curt Monash — Mon, 23 Sep 2013 13:24:42 +0000

Oracle announced its in-memory columnar option Sunday. As usual, I wasn’t briefed; still, I have some observations. For starters:

Oracle, IBM (Edit: See the rebuttal comment below), and Microsoft are all doing something similar …
… because it makes sense.
The basic idea is to take the technology that manages indexes — which are basically columns+pointers — and massage it into an actual column store. However …
… the devil is in the details. See, for example, my May post on IBM’s version, called BLU, outlining all the engineering IBM did around that feature.
Notwithstanding certain merits of this approach, I don’t believe in complete alternatives to analytic RDBMS. The rise of analytic DBMS oriented toward multi-structured data just strengthens that point.

I’d also add that Larry Ellison’s pitch “build columns to avoid all that index messiness” sounds like 80% bunk. The physical overhead should be at least as bad, and the main saving in administrative overhead should be that, in effect, you’re indexing ALL columns rather than picking and choosing.

Anyhow, this technology should be viewed as applying to traditional business transaction data, much more than to — for example — web interaction logs, or other machine-generated data. My thoughts around that distinction start:

I argued back in 2011 that traditional databases will wind up in RAM, basically because …
… Moore’s Law will make it ever cheaper to store them there.
Still, cheaper != cheap, so this is a technology only to use with your most valuable data — i.e., that transactional stuff.
These are very tabular technologies, without much in the way of multi-structured data support.

But in a bit of evidence that disconfirms my case, one of the first SAP applications to require HANA was something called “Smart Meter Analytics”.

To see more specifically where this technology could be useful, let’s map it against my 2011 analytic database taxonomy.

If you’re managing a partial EDW (Enterprise Data Warehouse) on the same technology as your OLTP (OnLine Transaction Processing) databases, but are running out of steam, in-memory columnar could provide some acceleration.
Traditional data marts are somewhat obsolete, and establishing a new one would be mainly a cost play. So the fit is questionable.
Investigative data marts could be a good fit, but only if you’re fairly unimaginative as to the kinds of data you want to include.
Several other categories are no fit at all.
There’s a good fit for certain kinds of operational analytics.

I’ll finish by expanding on that last point.

Operational applications have always had analytics blended in. If nothing else, there were a lot of straight reports; sometimes there’s a bit of optimization as well. Workday, for example, has BI and search as two of its core OLTP UI metaphors, and has a lot of other BI snippets called worklets as well. (And by the way, a lot of Workday’s database is in-memory.) I’ve thought for years that operational/analytic blending would be a major area of competition between Oracle and SAP; hence — I believe — SAP’s acquisitions of Business Objects and KXEN. Columnar in-memory Oracle features, and similarly SAP HANA, seem well-suited to support such application elements.