PostgreSQL – DBMS 2 : DataBase Management System Services

Introduction to SequoiaDB and SequoiaCM

Curt Monash — Sun, 12 Mar 2017 18:19:49 +0000

For starters, let me say:

SequoiaDB, the company, is my client.
SequoiaDB, the product, is the main product of SequoiaDB, the company.
SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
… and is usually sold for RDBMS-like use cases …
… except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
SequoiaDB’s products are open source.
SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.

Also:

SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
- SequoiaDB’s founders were Chinese nationals.
- Most of them went back to China.
- Other employees to date have been entirely Chinese.
- Sales to date have been entirely in China, but SequoiaDB has international aspirations
SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)

Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.

While I usually think that the advantages of open source are overstated, in SequoiaDB’s case open source will have* an additional benefit when SequoiaDB does go international — it addresses any concerns somebody might have about using Chinese technology.

*Edit: Actually, this claim is overstated based on SequoiaDB’s current open source practices. Please see the comment thread below.

SequoiaDB’s technology story starts:

SequoiaDB is a layered DBMS.
It manages JSON via update-in-place. MVCC (Multi-Version Concurrency Control) is on the roadmap.
Indexes are B-tree.
Transparent sharding and elasticity happen in what by now is the industry-standard/best-practices way:
- There are many (typically 4096) logical partitions, many of which are assigned to each physical partition.
- If the number of physical partitions changes, logical partitions are reassigned accordingly.
Relational OLTP (OnLine Transaction Processing) functionality is achieved by using a kind of PostgreSQL front end.
Relational batch processing is done via SparkSQL.
There also is a block/LOB (Large OBject) storage engine meant for content management applications.
SequoiaCM boils down technically to:
- SequoiaDB, which is used to store JSON metadata about the LOBs …
- … and whose generic-DBMS coordination capabilities are also used over the block/LOB engine.
- A Java library focused on content management.

SequoiaDB’s relationship with PostgreSQL is complicated, but as best I understand SequoiaDB’s relational operations:

SQL parsing, optimization, and so on rely mainly on PostgreSQL code. (Of course, there are some hacks, such as to the optimizer’s cost functions.)
Actual data storage is done via SequoiaDB’s JSON store, using PostgreSQL Foreign Data Wrappers. Each record goes in a separate JSON document. Locks, commits and so on — i.e. “write prevention” — are handled by the JSON store.
PostgreSQL’s own storage engine is actually part of the stack, but only to manage temp space and the like.

PostgreSQL stored procedures are already in the SequoiaDB product. Triggers and referential integrity are not. Neither, so far as I can tell, are PostgreSQL’s datatype extensibility capabilities.

I neglected to ask how much of that remains true when SparkSQL is invoked.

SequoiaDB’s use cases to date seem to fall mainly into three groups:

Content management via SequoiaCM.
“Operational data lakes”.
Pretty generic replacement of legacy RDBMS.

Internet back-ends, however — and this is somewhat counter-intuitive for an open-source JSON store — are rare, at least among paying subscription customers. But SequoiaDB did tell me of one classic IoT (Internet of Things) application, with lots of devices “phoning home” and the results immediately feeding a JSON-based dashboard.

To understand SequoiaDB’s “operational data lake” story, it helps to understand the typical state of data warehousing at SequoiaDB’s customers and prospects, which isn’t great:

2-3 years of data, and not all the data even from that time period.
Only enough processing power to support structured business intelligence …
… and hence little opportunity for ad-hoc query.

SequoiaDB operational data lakes offer multiple improvements over that scenario:

They hold as much relational data as customers choose to dump there.
That data can be simply copied from operational stores, with no transformation.
Or if data arrives via JSON — from external organizations or micro-services as the case may be — the JSON can be stored unmodified as well.
Queries can be run straight against this data soup.
Of course, views can also be set up in advance to help with querying.

Views are particularly useful with what might be called slowly changing schemas. (I didn’t check whether what SequoiaDB is talking about matches precisely with the more common term “slowly changing dimensions”.) Each time the schema changes, a new table is created in SequoiaDB to receive copies of the data. If one wants to query against the parts of the database structure that didn’t change — well, a view can be establish to allow for that.

Finally, it seems that SequoiaCM uses are concentrated in what might be called “security and checking-up” areas, such:

Photographs as part of an authentication process.
Video of in-person banking transactions, both for fraud prevention and for general service quality assurance.
Storage of security videos (for example from automated teller machines).

SequoiaCM deals seem to be bigger than other SequoiaDB ones, surely in part because the amounts of data managed are larger.

Thoughts and notes, Thanksgiving weekend 2014

Curt Monash — Mon, 01 Dec 2014 01:48:43 +0000

I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:

1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:

Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,

The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.

What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.

2. Three years ago I posted about agile (predictive) analytics. One of the points was:

… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.

Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.

3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with:

Also, the flashiest application I know of for only-moderately-successful KXEN came when one or more large retailers decided to run separate models for each of thousands of stores.

4. MongoDB, the product, has been refactored to support pluggable storage engines. In connection with that, MongoDB does/will ship with two storage engines – the traditional one and a new one from WiredTiger (but not TokuMX). Both will be equally supported by MongoDB, the company, although there surely are some tiers of support that will get bounced back to WiredTiger.

WiredTiger has the same techie principals as SleepyKat – get the wordplay?! – which was Mike Olson’s company before Cloudera. When asked, Mike spoke of those techies in remarkably glowing terms.

I wouldn’t be shocked if WiredTiger wound up playing the role for MongoDB that InnoDB played for MySQL. What I mean is that there were a lot of use cases for which the MySQL/MyISAM combination was insufficiently serious, but InnoDB turned MySQL into a respectable DBMS.

5. Hadoop’s traditional data distribution story goes something like:

Data lives on every non-special Hadoop node that does processing.
This gives the advantage of parallel data scans.
Sometimes data locality works well; sometimes it doesn’t.
Of course, if the output of every MapReduce step is persisted to disk, as is the case with Hadoop MapReduce 1, you might create some of your own data locality …
… but Hadoop is getting away from that kind of strict, I/O-intensive processing model.

However, Cloudera has noticed that some large enterprises really, really like to have storage separate from processing. Hence its recent partnership to work with EMC Isilon. Other storage partnerships, as well as a better fit with S3/object storage kinds of environments, are sure to follow, but I have no details to offer at this time.

6. Cloudera’s count of Spark users in its customer base is currently around 60. That includes everything from playing around to full production.

7. Things still seem to be going well at MemSQL, but I didn’t press for any details that I would be free to report.

8. Speaking of MemSQL, one would think that at some point something newer would replace Oracle et al. in the general-purpose RDBMS world, much as Unix and Linux grew to overshadow the powerful, secure, reliable, cumbersome IBM mainframe operating systems. On the other hand:

IBM blew away its mainframe competitors and had pretty close to a monopoly. But Oracle has some close and somewhat newer competitors in DB2 and Microsoft SQL Server. Therefore …
… upstarts have three behemoths to outdo, not just one.
MySQL, PostgreSQL and to some extent Sybase are still around as well.

Also, perhaps no replacement will be needed. If we subdivide the database management world into multiple categories including:

General-purpose RDBMS.
Analytic RDBMS.
NoSQL.
Non-relational analytic data stores (perhaps Hadoop-based).

it’s not obvious that the general-purpose RDBMS category on its own requires any new entrants to ever supplant the current leaders.

All that said – if any of the current new entrants do pull off the feat, SAP HANA is probably the best (longshot) guess to do so, and MemSQL the second-best.

9. If you’re a PostgreSQL user with performance or scalability concerns, you might want to check what Citus Data is doing.

21st Century DBMS success and failure

Curt Monash — Mon, 14 Jul 2014 06:37:31 +0000

As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.

DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.

In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.

Buyer inertia is a greater concern.

A significant minority of enterprises are highly committed to their enterprise DBMS standards.
Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.

A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.

First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.

Otherwise I’d say:

At large enterprises, their internet operations perhaps excepted:
- Short-request/general-purpose SQL alternatives to the behemoths — e.g. MySQL, PostgreSQL, NewSQL — have had tremendous difficulty getting established. The last big success was the rise of Microsoft SQL Server in the 1990s. That’s why I haven’t mentioned the term mid-range DBMS in years.
- NoSQL/non-SQL has penetrated large enterprises mainly for a few specific use cases, for example the lists I posted for MongoDB or graph databases.
Internet-only companies have few inertia issues when it comes to database managers. They’ll consider anything they regard as being in their price ballpark (which is however often restricted to open source). I think part of the reason is that as quickly as they rewrite their applications, DBMS are vastly less “strategic” to them than they are to most larger enterprises.
The internet operations of large companies — especially large retailers — in many cases behave like internet-only companies, but in many other cases behave like the rest of the enterprise.

The major reasons for DBMS categories to get established in the first place are:

Performance and/or scalability (many examples).
Developer features (for example dynamic schema).
License/maintenance cost (for example several open source categories).
Ease of installation and administration (for example open source again, and also data warehouse appliances).

Those same characteristics are major bases for competition among members of a new category, although as noted above behemoth-loyalty can also come into play.

Cool-vs.-weird tradeoffs are somewhat secondary among SQL DBMS.

There’s not much of a “cool” factor, because new products aren’t that different in what they do vs. older ones.
There’s not a terrible “weird” factor either, but of course any smaller offering faces FUD, and also …
… appliances are anti-strategic for many buyers, especially ones who demand a smooth path to the cloud.)

They’re huge, however, in the non-SQL world. Most non-SQL data managers have a major “weird” factor. Fortunately, NoSQL and Hadoop both have huge “cool” cred to offset it. XML/XQuery unfortunately did not.

Finally, in most DBMS categories there are massive issues with product completeness, more in the area of maturity than that of whole product. The biggest whole product issues are concentrated on the matter of interoperating with other software — business intelligence tools, packaged applications (if relevant to the category), etc. Most notably, the handful of DBMS that are certified to run SAP share a huge market that other DBMS can’t touch. But BI tools are less of a differentiator — I yawn when vendors tell me they are certified for/partnered with MicroStrategy, Tableau, Pentaho and Jaspersoft, and I’m surprised at any product that isn’t.

DBMS maturity has a lot of aspects, but the toughest challenges are concentrated in two main areas:

Reliability, especially but not only in short-request use cases.
Performance across a great variety of use cases. I observe frequently that performance in best-case scenarios, performance in the lab and performance in real-world environments are much further apart than vendors like to think.

In particular:

Maturity demands seem to be much higher for SQL DBMS than for NoSQL.
- I think this is one of several reasons NoSQL has been much more successful than NewSQL.
- It’s why I think MarkLogic’s “Enterprise NoSQL” positioning is a mistake.
As for MySQL:
- MySQL wasn’t close to reliable enough for enterprises to trust it until InnoDB became the default storage engine.
- MySQL 5 point releases have added major features, or decent performance for major features. I’ll confess to having lost track of what’s been fixed and what’s still missing.
- In saying all that I’m holding MySQL to a much higher maturity standard than I’m holding NoSQL — because that’s what I think enterprise customers do.
PostgreSQL “should” be doing a lot better than it is. I have an extremely low opinion of its promoters, and not just for personal reasons. (That said, the personal reasons don’t just apply to EnterpriseDB anymore. I’ve also run out of patience waiting for Josh Berkus to retract untruths he posted about me years ago.)
SAP HANA checks boxes for performance (In-memory rah rah rah!!) and whole product (Runs SAP!!). That puts it well ahead of most other newish SQL DBMS, purely analytic ones perhaps excepted.
Any other new short-request SQL DBMS that sounds like is has traction is also memory-centric.
Analytic RDBMS are in most respects held to lower maturity standards than DBMS used for write-intensive workloads. Even so, products in the category are still frequently tripped up by considerations of concurrent performance and mixed workload management.

Related links

There have been 1,470 previous posts in the 9-year history of this blog, many of which could serve as background material for this one. A couple that seem particularly germane and didn’t get already get linked above are:

The drive for uninterrupted DBMS operation.
Short-request DBMS trade-offs and alternatives.

Notes and comments, May 6, 2014

Curt Monash — Tue, 06 May 2014 13:46:54 +0000

After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.

My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
My CitusDB post picked up a few clarifying comments.

Here is a catch-all post to complete the set.

1. The recently-announced Cloudera/MongoDB relationship* is still at the Barney stage. That said, I’m optimistic that their stated intention to add substance to the relationship will eventually come to fruition. If nothing else, the two companies have high regard for each other, at least at the Mike Olson/Max Schireson level.

*That’s one of numerous deals with my fingerprints on it, but in this case only lightly. It was probably on track to happen even without my nudges.

2. Most of what I talked about when I visited MongoDB is confidential; the public stuff was mainly in my recent MongoDB technology post. But in one exception, I asked Max for an update as to MongoDB enterprise use cases. He reported a cluster in data combination, especially but not only in use cases which have both a high-volume part and dynamic-schema aspects. Specific examples Max cited included:

Tracking financial holdings from a variety of asset classes — especially if derivatives are involved, because they have a dynamic-schema aspect.
Product catalogs, including for use on web sites.
Customer information.
Patient information.

3. I didn’t ask everybody I saw in California about business trends, and much of what we did discuss was confidential. That said:

MapR was proud of its numbers.
So was DataStax.
ClearStory has a bunch of Very Big Enterprises as customers, mainly but not only in consumer sectors (e.g. retail, packaged goods).

4. Platfora is focusing a bit, starting with clickstream and security — i.e., event series stuff. And by the way, they report that the term “event series” is working well for them.

5. I gather from a variety of comments and conversations that Amazon Redshift has achieved considerable traction.

6. Something I can’t find evidence of having posted before: I think multiple businesses monitor online sales or similar business successes as a guide to network problems. eBay did this via a custom in-memory MOLAP (Multidimensional Online Analytic Process) system years ago. Best evidence that this is hardly restricted to eBay: all the “me-too” responses I get from telling that story.

7. Citus Data tells me that as of PostgreSQL 9.4, Postgres will be able to return just the part of a JSON column needed for a query. This is as opposed to storing the whole thing as text and only retrieving it in its entirety.

8. In the comments to my “Spark on fire” post, Patrick McFadin pointed out that Mahout is transitioning from MapReduce to Spark. (All new work will be on Spark, although old MapReduce-based routines will continue to be supported.) It turns out that Derrick Harris wrote about that over a month ago, and I just missed the news.

9. Also in predictive analytics — there are rumblings that R could eventually be supplanted by Julia, although R’s massive libraries of algorithms still give it the advantage now.

10. Multiple vendors, fed up with the intermittent slowdowns from garbage collection, are moving some processing off the Java heap. Unfortunately, I neglected to ask any of them what the remaining differences then were between Java and C++ programming.

11. And to finish on a light note: BDAS — the project of which Spark is only a part — is pronounced “bad-ass”, something I first heard from Dave Patterson.

Introduction to CitusDB

Curt Monash — Fri, 02 May 2014 08:00:08 +0000

One of my lesser-known clients is Citus Data, a largely Turkish company that is however headquartered in San Francisco. They make CitusDB, which puts a scale-out layer over a collection of fully-functional PostgreSQL nodes, much like Greenplum and Aster Data before it. However, in contrast to those and other Postgres-based analytic MPP (Massively Parallel Processing) DBMS:

CitusDB does not permanently fork PostgreSQL; Citus Data has committed to always working with the latest PostgreSQL release, or at least with one that’s less than a year old.
Citus Data never made the “fat head” mistake — if a join can’t be executed directly on the CitusDB data-storing nodes, it can’t be executed in CitusDB at all.
CitusDB follows the modern best-practice of having many virtual nodes on each physical node. Default size of a virtual node is one gigabyte. Each virtual node is technically its own PostgreSQL table.*
Citus Data has already introduced an open source column-store option for PostgreSQL, which CitusDB of course exploits.

*One benefit to this strategy, besides the usual elasticity and recovery stuff, is that while PostgreSQL may be single-core for any given query, a CitusDB query can use multiple cores by virtue of hitting multiple PostgreSQL tables on each node.

Citus has thrown a few things against the wall; for example, there are two versions of its product, one which involves HDFS (Hadoop Distributed File System) and one of which doesn’t. But I think Citus’ focus will be scale-out PostgreSQL for at least the medium-term future. Citus does have actual customers, and they weren’t all PostgreSQL users previously. Still, the main hope — at least until the product is more built-out — is that existing PostgreSQL users will find CitusDB easy to adopt, in technology and price alike.

Notwithstanding what I said about “fat heads”, CitusDB does have a concept of Master nodes. These:

Also use single-node copies of PostgreSQL.
Are blessedly able to scale out, although their underlying databases are entirely replicated.
Store no actual data, but do store metadata about each virtual node, including:
- Structural metadata.
- Location.
- Min/max column values (for data skipping).
- But not (yet) stats to help with query optimization.
Do some query planning and rewriting.
Handle administration, some of which is nicely parallelized/centralized. (E.g., an index choice can be made once and automatically propagated across all the relevant virtual nodes.)

CitusDB is definitely in its early days. For example:

If I understand correctly, the recent CitusDB 3.0 release is the first one on which data is redistributed among shards. Before that, you could only join tables that were either sharded on the same key, or else small enough to be broadcast-replicated across the whole cluster.
SQL coverage isn’t great. (E.g., no Windowing.)
Some hard-to-parallelize things aren’t implemented yet, e.g. exact median or generally-usable COUNT DISTINCT.
ACID is still lacking. Writes are batch-only, micro-batch or otherwise as the case may be.
CitusDB’s backup story is primitive, with the main options being:
- You can rely on having replicas on multiple nodes, even — if you like — in different data centers.
- You can backup each of the PostgreSQL nodes separately; CitusDB doesn’t yet offer automation for that.
CitusDB’s query optimization sounds pretty primitive.
I don’t recall Citus telling me of serious workload management.
CitusDB compression is block-level only. (PostgreSQL’s version of Lempel-Ziv.)

Still, the Citus Data folks seem to have good ideas, including some — as yet undisclosed — plans going forward. So if it sounds as if CitusDB might fit your needs better than more established scale-out RDBMS do, I’d encourage you to take a look at what Citus offers.

Notes and comments, March 17, 2014

Curt Monash — Mon, 17 Mar 2014 07:09:15 +0000

I have ever more business-advice posts up on Strategic Messaging. Recent subjects include pricing and stealth-mode marketing. Other stuff I’ve been up to includes:

The Spark buzz keeps increasing; almost everybody I talk with expects Spark to win big, probably across several use cases.

Disclosure: I’ll soon be in a substantial client relationship with Databricks, hoping to improve their stealth-mode marketing.

The “real-time analytics” gold rush I called out last year continues. A large fraction of the vendors I talk with have some variant of “real-time analytics” as a central message.

Basho had a major change in leadership. A Twitter exchange ensued. Joab Jackson offered a more sober — figuratively and literally — take.

Hadapt laid off its sales and marketing folks, and perhaps some engineers as well. In a nutshell, Hadapt’s approach to SQL-on-Hadoop wasn’t selling vs. the many alternatives, and Hadapt is doubling down on poly-structured data*/schema-on-need.

*While Hadapt doesn’t to my knowledge use the term “poly-structured data”, some other vendors do. And so I may start using it more myself, at least when the poly-structured/multi-structured distinction actually seems significant.

WibiData is partnering with DataStax, WibiData is of course pleased to get access to Cassandra’s user base, which gave me the opportunity to ask why they thought Cassandra had beaten HBase in those accounts. The answer was performance and availability, while Cassandra’s traditional lead in geo-distribution wasn’t mentioned at all.

Disclosure: My fingerprints are all over that deal.

In other news, WibiData has had some executive departures as well, but seems to be staying the course on its strategy. I continue to think that WibiData has a really interesting vision about how to do large-data-volume interactive computing, and anybody in that space would do well to talk with them or at least look into the open source projects WibiData sponsors.

I encountered another apparently-popular machine-learning term — bandit model. It seems to be glorified A/B testing, and it seems to be popular. I think the point is that it tries to optimize for just how much you invest in testing unproven (for good or bad) alternatives.

I had an awkward set of interactions with Gooddata, including my longest conversations with them since 2009. Gooddata is in the early days of trying to offer an all-things-to-all-people analytic stack via SaaS (Software as a Service). I gather that Hadoop, Vertica, PostgreSQL (a cheaper Vertica alternative), Spark, Shark (as a faster version of Hive) and Cassandra (under the covers) are all in the mix — but please don’t hold me to those details.

I continue to think that computing is moving to a combination of appliances, clusters, and clouds. That said, I recently bought a new gaming-class computer, and spent many hours gaming on it just yesterday.* I.e., there’s room for general-purpose workstations as well. But otherwise, I’m not hearing anything that contradicts my core point.

*The last beta weekend for The Elder Scrolls Online; I loved Morrowind.

Distinctions in SQL/Hadoop integration

Curt Monash — Sun, 09 Feb 2014 18:50:38 +0000

Ever more products try to integrate SQL with Hadoop, and discussions of them seem confused, in line with Monash’s First Law of Commercial Semantics. So let’s draw some distinctions, starting with (and these overlap):

Are the SQL engine and Hadoop:
- Necessarily on the same cluster?
- Necessarily or at least most naturally on different clusters?
How, if at all, is Hadoop invoked by the SQL engine? Specifically, what is the role of:
- HDFS (Hadoop Distributed File System)?
- Hadoop MapReduce?
- HCatalog?
How, if at all, is the SQL engine invoked by Hadoop?

In particular:

If something is called a “connector”, then Hadoop and the SQL engine are most likely on separate clusters. Good features include (but these can partially contradict each other):
- A way of making data transfer maximally parallel.
- Query planning that is smart about when to process on the SQL engine and when to use Hadoop’s native SQL (Hive or otherwise).
If something is called “SQL-on-Hadoop”, then Hadoop and the SQL engine are or should be on the same cluster, using the same nodes to store and process data. But while that’s a necessary condition, I’d prefer that it not be sufficient.

Let’s go to some examples.

Hive is the closest example of SQL/Hadoop integration known. Hive executes a somewhat low-grade dialect of SQL — HQL (Hive Query Language) — via very standard Hadoop: Hadoop MapReduce, all HDFS file formats, etc. HCatalog is an enhancement/replacement for the Hive metadata store. HQL is just another language that can be used to write (parts of) Hadoop jobs.

Impala is Cloudera’s replacement for Hive. Impala is and/or is planned to be much like Hive, but much better, for example in performance and in SQL functionality. Impala has its own custom execution engine, including a daemon on every Hadoop data node, and seems to run against a variety of but not all HDFS file formats.

Stinger is Hortonworks’ (and presumably also Apache’s) answer to Impala, but is more of a Hive upgrade than an outright replacement. In particular, Stinger’s answer to the new Impala engine is a port of Hive to the new engine Tez.

Teradata SQL-H is an RDBMS-Hadoop connector that uses HCatalog, and plans queries across the two clusters. Microsoft Polybase is like SQL-H, but it seems more willing than Teradata or Teradata Aster to (optionally) coexist on the same nodes as Hadoop.

Hadapt runs on the Hadoop cluster, putting PostgreSQL* and other software on each Hadoop data node. It has two query engines, one that invokes Hadoop MapReduce (the original one, still best for longer-running queries) and one that doesn’t (more analogous to Impala). When last I looked, Hadapt didn’t query or update against the HDFS API, but there was an interesting future in preloading data from HDFS into Hadapt PostgreSQL tables, and I think that Hadapt’s PostgreSQL tables are technically HDFS files. I don’t think Hadapt makes much use of HCatalog.

*Hacked to allow Hadapt to offer more than just SQL/Hadoop integration.

Splice Machine is a new entrant (public beta is imminent) that has put Apache Derby over an HBase back end. (Apache Derby is the former Cloudscape, an embeddable Java RDBMS that was acquired by Informix and hence later by IBM.) Splice Machine runs on your Hadoop nodes as an HBase coprocessor. Its relationship to non-HBase parts of Hadoop is arm’s-length. I wish this weren’t called “SQL-on-Hadoop”.

Related links

Dan Abadi and Dave Dewitt opined last June about how to categorize Hadapt and Polybase.
My most detailed discussions of Impala and Stinger were last June and August, respectively.

Schema-on-need

Curt Monash — Sun, 22 Sep 2013 00:23:49 +0000

Two years ago I wrote about how Zynga managed analytic data:

Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) … Zynga adds data into the real schema when it’s clear it will be needed for a while.

What was then the province of a few huge web companies is now poised to be a broader trend. Specifically:

Relational DBMS are adding or enhancing their support for complex datatypes, to accommodate various kinds of machine-generated data.
- MongoDB-compatible JSON is the flavor of the day on the short-request side, but alternatives include other JSON, XML, other key-value, or text strings.
- It is often possible to index on individual attributes inside the complex datatype.
The individual attributes inside the complex datatypes amount to virtual columns, which can play similar roles in SQL statements as physical columns do.
Over time, the DBA may choose to materialize virtual columns as additional physical columns, to boost query performance.

That migration from virtual to physical columns is what I’m calling “schema-on-need”. Thus, schema-on-need is what you invoke when schema-on-read no longer gets the job done.

For years people have been putting data into DBMS (usually but not exclusively relational ones), building some indexes immediately, then adding more indexes to improve performance later as requirements are discovered. Materialized views play a similar role. Schema-on-need is a continuation of the same idea, but targeted at poly-structured data.

Why not just materialize all possible columns immediately? There are two main reasons:

The result would be unwieldy and sparse. In some nested data structure cases, it seems that billions of columns could be needed. (I haven’t checked that estimate on my own, but it sounds plausible.)
While materializing a column helps query performance, it slows down writes. (Indexing generally has the same tradeoffs.)

Hadapt’s approach to schema-on-need

Hadapt has a particularly interesting approach to schema-on need. As you may recall, Hadapt adds software to an ordinary Hadoop cluster so that it also functions as a decent analytic DBMS. One of those additions is a copy of PostgreSQL on every node, and in its next patch, Hadapt will incorporate modifications to PostgreSQL that are designed to support schema-on-need. More precisely, Hadapt will offer transparent support for physical schema-on-need, so that logically you can have a fairly tabular schema in place from the get-go.

The key elements of Hadapt’s approach are:

A custom datatype to accommodate and serialize sets of key-value pairs. Each key equates to a logical column.
A redirector that points query execution steps at physical or logical columns as the case may be.

Indeed, a Hadapt column can even be part physical, part logical. The main use for this is in cases when most but not all values in a column have the same datatype. The data of majority datatype then goes into the physical column, while the rest is stored in the catch-all key-value column.

In a naive version of this strategy, queries would involve retrieving a lot of rows that each have a wide field, then scanning the wide field to extract a particular keyed value. That could be slow and tedious. Hadapt’s hacks to mitigate that performance problem include:

The keys are encoded together at the start of the field to be fixed-length. Hadapt says this allows for fast binary search.
Stored along with the keys are offsets to indicate where the (variable-length) values can be found.

And yes — Hadapt originally looked into using PostgreSQL’s JSON and key-value datatypes, but determined they were much too slow for these purposes.

This is new technology, so of course there are various rough edges.

While the key-value pairs could come from JSON, XML or other sources, Hadapt doesn’t currently offer much support for JSON- or XML-based data manipulation languages, basic ingest perhaps excepted. Rather, the data is just sets of key-value pairs, which may or may not have interesting naming conventions for the keys.
The PostgreSQL optimizer and query planner are in the dark; they just assume all columns are physical. Ditto the global Hadapt query engine.
Transitioning columns from virtual to physical status is still a very manual process.
If you try to do something on a multi-datatype column that could get you into trouble, there isn’t much protection in place against the resulting misfortunes.

Whether the missing features are soon added will depend in part on whether Hadapt commits to a strategy such as “analytic DBMS for multi-structured data”. I think they will, but it’s too early to be certain.

What about short-request schema-on-need?

As I’ve described it, schema-on-need seems focused on analytic/query-mainly use cases. Still, the “real-time” analytics boom suggests it may be relevant on the short-request side too. Stay tuned as I try to figure out just what is and isn’t happening in the relational+JSON world, for example in products such as MemSQL or DB2.

Related links

Hadapt’s architecture (June, 2013, with links to earlier posts)
DBMS with multiple DMLs (September, 2013)
The relational/non-tabular tradeoff (January, 2013)
Dynamic schemas (July, 2011)

Dave DeWitt responds to Daniel Abadi

Curt Monash — Thu, 06 Jun 2013 04:02:48 +0000

A few days ago I posted Daniel Abadi’s thoughts in a discussion of Hadapt, Microsoft PDW (Parallel Data Warehouse)/PolyBase, Pivotal/Greenplum Hawq, and other SQL-Hadoop combinations. This is Dave DeWitt’s response. Emphasis mine.

A key point that Daniel seems to be making is that parallel relational database systems are significantly faster than those that rely on the use of MapReduce for their query engines. I totally agree. In fact, several of us have been making the same point for years now (starting with the blog posts that Mike Stonebraker and I wrote more than 5 years ago). Time and time again relational database systems have been shown to be significantly faster. Last year we published a paper comparing PDW (w/o PolyBase) to Hive on two identical clusters (http://vldb.org/pvldb/vol5/p1712_avriliafloratou_vldb2012.pdf). We found that PDW was 3-10 times faster than Hive when executing the TPC-H benchmark.

Cloudera (Impala) and Pivotal (Hawq) seem to have come around to this same viewpoint. While both systems use HDFS for storage, neither uses MapReduce for executing query plans of relational operators. The Impala query engine was written from scratch in C++, and Hawq (apparently) uses a version of the Greenplum query engine adapted to read data directly from HDFS.

Hadapt, like Impala and Hawq, assumes (in general) that there is a single cluster in play and that they (as the DBMS vendor) can elect to use the resources (CPU, memory, and disk) on each node of the cluster in the way they think is best. For example, all three use the CPU and memory resources of the cluster to run a relational DB engine on each node. While Impala and Hawq leave the data in HDFS, Hadapt has concluded that they can get better performance by loading the data into PostgreSQL tables before executing most queries. In the case of Hadapt, one conceptually could think of PostgreSQL instances on each Hadoop “datanode” as a special type of “file format” that has the capability of not only storing local data, but also performing some query execution on that local chunk of data. Otherwise, the overall global execution is based on the traditional MapReduce engine accessing the data either from HDFS or from PostgreSQL. It is interesting that all three systems have, to varying degrees, concluded that many of their customers are willing to sacrifice the fault tolerance and ultimate scalability that MapReduce provides for performance.

Unlike Hadapt, Impala, and Hawq, in designing and building PolyBase our goal was not to build a general purpose scalable data warehousing solution. For that we already have SQL Server PDW. Rather, our goal was to extend the capabilities of PDW by allowing customers to use inexpensive Hadoop clusters while preserving the same T-SQL interface (which is used by a large number of third party applications and BI tools) to easily query and combine data regardless of where it lives and what format it is in. We expect that customers will primarily use the Hadoop cluster for their “cold” data or as their “digital shoebox”.

In addition, unlike Hadapt, Hawq, and Impala, PDW with PolyBase does not make a single-cluster assumption. Users may have two or more clusters, or two or more regions of the same cluster dedicated to different types of data. A deliberate design decision that we made at the beginning of the project was not to assume any control over a customer’s Hadoop cluster. Rather, PolyBase only assumes that it (1) can read and write HDFS files and (2) can submit MapReduce jobs to the cluster for execution. PolyBase is agnostic about what operating system the nodes of the Hadoop cluster are running, whose Hadoop distribution the cluster is running (Hortonworks, Cloudera, etc.), or whether the cluster is on-premise or in the cloud. We deliberately adopted this approach as we felt that it gave our customers the degree of flexibility that they need to be successful for a wide range of applications. In PolyBase, clusters are treated as first-class citizens by the system; the decision as to where to process parts of a query is determined by the system’s parallel query optimizer, based on the location of the data and capabilities of the cluster (e.g., # of nodes, CPU, memory of the cluster, load, etc.). In some situations, PDW data may be pushed into a Hadoop cluster to do the processing there instead. Furthermore, PolyBase can also be used to seamlessly query Hadoop-only data from two or more distinct clusters (e.g., one on-premise and another in the cloud) without combining with any RDBMS data.

One database to rule them all?

Curt Monash — Thu, 21 Feb 2013 05:52:05 +0000

Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times.

Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:

Every level of storage (disk, RAM, etc.).
Indexes, aggregates and raw data alike.

To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to:

Big blocks of data compress better, and can be also be faster to retrieve than a number of smaller blocks holding the same amount of data. Small blocks of data can be less wasteful to write. And different kinds of storage have different minimum block sizes.
Operating on compressed data offers multiple significant efficiencies. But you have to spend cycles (de)compressing it, and it’s only practical for some compression schemes.
Fixed-length tabular records can let you compute addresses rather than looking them up in indexes. Yay! But they also waste space.
Tokenization can help with the fixed-/variable-length tradeoff.
Pointers are wonderfully efficient for some queries, at least if you’re not using spinning disk. But they can create considerable overhead to write and update.
Indexes, materialized views, etc. speed query performance, but can be costly to write and maintain.
Storing something as a BLOB (Binary Large OBject), key-value payload, etc. is super-fast — but if you want to look at it, you usually have to pay for retrieving the whole thing.

What’s more, different data layouts can have different implications for logging, locking, replication, backup and more.

So what would happen if somebody tried to bundle all conceivable functionality into a single DBMS, with a plan to optimize the layout of any particular part of the database as appropriate? I think the outcome would be tears — for the development effort would be huge, while the benefits would be scanty. The most optimistic cost estimates could run in the 100s of millions of dollars, with more realistic ones adding a further order of magnitude. But no matter what the investment, the architects would be on the horns of a nasty dilemma:

If there’s much commonality among the component DBMS, each one would be sub-optimal.
If there’s little commonality among them, then there’s also little benefit to the combination.

Adding insult to injury, all the generality would make it hard to select optimum hardware for this glorious DBMS — unless, of course, a whole other level of development effort made it work well across very heterogeneous clusters.

Less megalomaniacally, there have been many attempts to combine two or more alternate data layouts in a single DBMS, with varying degrees of success. In the relational-first world:

Analytic DBMS have combined row and column data models so fluidly that I’ve made fun of Oracle for not being able to pull it off. SAP HANA sort of does the same thing, but perhaps with a columnar bias, and not just for analytics.
Relational DBMS can also have a variety of index types, suitable for different relational use cases. This is especially true for analytic uses of general-purpose RDBMS.
Oracle, DB2, PostgreSQL, and Informix have had full extensibility architectures since the 1990s. That said:
- Almost all the extensions come from the DBMS vendors themselves.
- Extensions that resemble (or are) a tabular datatype — for example geospatial or financial-date — are often technically well-regarded.
- Others are usually not so strong technically, but in a few cases sell well anyway (e.g. Oracle Text).
- While Microsoft never went through the trouble of offering full extensibility, otherwise the SQL Server story is similar.
- Sybase’s extensibility projects went badly in the 1990s, and Sybase doesn’t seem to have tried hard in that area since.
IBM DB2, Microsoft SQL Server, and Oracle added XML capabilities around the middle of the last decade.
Analytic platforms can wind up with all sorts of temporary data structures.
Analytic DBMS have various ways to reach out and touch Hadoop.

Further:

Non-relational DBMS commonly have indexes that at least support relational-like SELECTs. JOINs can be more problematic, but MarkLogic finally has them. Tokutek even offers a 3rd-party indexing option for MongoDB.
Hadoop is growing into what is in effect is a family of DBMS and other data stores — generic HDFS, HBase, generic Hive, Impala, and so on. At the moment, however, none of them is very mature. BDAS/Spark/Shark ups the ante further, but of course that’s less mature yet.
Hadapt combines Hadoop and PostgreSQL.
DataStax combines Cassandra, Hadoop, and Solr.
Akiban fondly thinks its data layouts are well-suited for relational tables, JSON, and XML alike. (But business at Akiban may be in flux.)
GenieDB (Version 1 only) and NuoDB are both implemented over key-value stores. GenieDB Version 2 is implemented over Berkeley DB or MySQL.
Membase/Couchbase was first implemented over SQLite, then over (a forked version of) CouchDB.

Related links

A taxonomy of database use cases (July, 2012)
An early form of this discussion in the single domain of analytic RDBMS (February, 2009)