Theory and architecture – DBMS 2 : DataBase Management System Services

Brittleness and incremental improvement

Curt Monash — Wed, 20 Jun 2018 09:13:56 +0000

Every system — computer or otherwise — needs to deal with possibilities of damage or error. If it does this well, it may be regarded as “robust”, “mature(d), “strengthened”, or simply “improved”.* Otherwise, it can reasonably be called “brittle”.

*It’s also common to use the word “harden(ed)”. But I think that’s a poor choice, as brittle things are often also hard.

0. As a general rule in IT:

New technologies and products are brittle.
They are strengthened incrementally over time.

There are many categories of IT strengthening. Two of the broadest are:

Bug-fixing.
Bottleneck Whack-A-Mole.

1. One of my more popular posts stated:

Developing a good DBMS requires 5-7 years and tens of millions of dollars.

The reasons I gave all spoke to brittleness/strengthening, most obviously in:

Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

Similar things are true for other kinds of “platform software” or distributed systems.

2. The UI brittleness/improvement story starts similarly:

Graphical user interfaces can present users’ choices clearly, making them great antidotes to users’ initial lack of system knowledge or training.
Usability testing and engineering can lead to improvements and the removal of glitches.

Unfortunately, however, as systems add or change features, UI navigation can get more difficult over time rather than easier.

In at least one scenario — plane crashes due to confused-pilot error — the consequences can be literally fatal.

3. Sometimes brittleness just doesn’t get solved.

Security is perhaps the most visible example. Almost every security system can be broken, and bad actors actively do so.
Another example was 1980s-90s CASE (Computer-Aided Software Engineering), specifically in the area of generating code from specifications. The technology was only able to generate apps that performed a limited set of functions — too limited to be useful outside of certain niches — and never successfully evolved further.

4. Large organizations are riddled with screw-ups. One of the most successful large enterprises in world history was the US military of World War 2 — and that is literally the organization where the word snafu was coined.

The response is often bureaucracy. Somebody makes a mistake; procedures and rules are then instituted to ensure the mistake is never repeated. Over time, many rules and procedures build up, until organizational systems are hardened. Business processes wind up taking many steps, each of which represents both a cost and a potential for failure. And sometimes the only decisions that successfully get through the process are uninspired, uncreative or flat-out wrong.

This is the classic example of “hardening” — commonly expressed via its rough synonym “calcification” — adding even more brittleness than it removes.

Outward-facing/regulatory bureaucracies can be even worse..

Regulators generally have two constituencies — consumers/general public and businesses — with the benefits of regulation going to one and the costs going to the other. Anything regulators do will likely displease at least one constituency.
These ever-unsatisfactory regulations can often only be changed through a long administrative or legislative process.
Violating regulations has unpredictable but sometimes severe consequences.

The whole thing is a colossal mess.

5. Many of the previous points apply to enterprise applications, which facilitate business processes, have UIs, and commonly involve platform-like technology as well.

Changing enterprise apps can take both discontinuous and incremental forms; you can either replace your old apps outright or change something in (or in the implementation of) the ones you have.

If you rip-and-replace your apps, you’re likely to also do so to your business processes, and vice-versa. Discontinuous business process change is often seen as a great virtue, sometimes under the buzzname business process reengineering (BPR).
If you want to change your processes more incrementally, you likely need one or both of two things:
- App software with more features than you initially need. That may be easy to get, but it isn’t cheap.
- A nimble IT department. That one is neither cheap nor easy.

6. My biggest reason for writing about brittleness and improvement is to approach some topics around analytics and AI. As previously noted:

Artificial intelligence is facing public skepticism both for being too accurate (!) and not accurate enough.
Analytics in general is often surprisingly inaccurate.

Please stay tuned.

Related link(s)

Most of what I wrote in my December, 2015 series about artificial intelligence still holds true.

Technology implications of political trends

Curt Monash — Sun, 20 May 2018 19:29:09 +0000

The tech industry has a broad range of political concerns. While I may complain that things have been a bit predictable in other respects, politics is having real and new(ish) technical consequences. In some cases, existing technology is clearly adequate to meet regulators’ and customers’ demands. Other needs look more like open research challenges.

1. Privacy regulations will be very different in different countries or regions. For starters:

This is one case in which the European Union’s bureaucracy is working pretty well. It’s making rules for the whole region, and they aren’t totally crazy ones.
Things are more chaotic in the English-speaking democracies.
Authoritarian regimes are enacting anti-privacy rules.

All of these rules are subject to change based on:

Genuine technological change.
Changes in politicians’ or the public’s perceptions.

And so I believe: For any multinational organization that handles customer data, privacy/security requirements are likely to change constantly. Technology decisions need to reflect that reality.

2. Data sovereignty/geo-compliance is a big deal. In fact, this is one area where the EU and authoritarian countries such as Russia formally agree. Each wants its citizens’ data to be stored locally, so as to ensure adherence to local privacy rules.

For raw, granular data, that’s a straightforward — even if annoying — requirement to meet. But things get murkier for data that is aggregated or otherwise derived.

3. Data anonymization needs to be credibly and reliably solved. Reliable data anonymization, in principle, could moot multiple conflicts, including:

The tension between data sovereignty and global analytics.
The tension between medical research and health data privacy.

But doing so will be extremely hard.

Consumer internet deanonymization is brutally effective.
- Birth date/zip code combinations come close to identifying people uniquely.
- So does the computer/browser configuration information that gets communicated to any website.
Ordinary citizens are only beginning to realize how not-anonymous they really are.
Mistrust of data-rich technology companies has recently exploded.

4. Transparency will be demanded, in multiple forms. For starters:

The complexity and sophistication of data privacy regulation will naturally lead to demands that compliance be straightforwardly auditable.
Those demands will come from regulators and consumers alike.

We more or less know how to do that part already.

But where things get really messy is in the area of “black box” algorithms. People are concerned about potentially-untrustworthy automated decisions in many areas, such as:

Self-driving cars (which occasionally kill people).
Junk-filled news feeds (which make democracy more difficult).
Various “creepy” e-commerce behaviors.

In principle, I stand by my opinion from 2012: When consumers lose trust in algorithmic decision makers, it can be at least partially regained by a shift to translucent modeling. But the mathematics of accomplishing that seem — as it were — rather unclear.

Related link

Already in 2011, denanonymization and other technologies of privacy threats were very effective.

Some stuff that’s always on my mind

Curt Monash — Sun, 20 May 2018 19:27:21 +0000

I have a LOT of partially-written blog posts, but am struggling to get any of them finished (obviously). Much of the problem is that they have so many dependencies on each other. Clearly, then, I should consider refactoring my writing plans.

So let’s start with this. Here, in no particular order, is a list of some things that I’ve said in the past, and which I still think are or should be of interest today. It’s meant to be background for numerous posts I write in the near future, and indeed a few hooks for such posts are included below.

1. Data(base) management technology is progressing pretty much as I expected.

Vendors generally recognize that maturing a data store is an important, many-years-long process.
Multiple kinds of data model are viable …
… but it’s usually helpful to be able to do some kind of JOIN.
To deal with the variety of hardware/network/storage arrangements out there, layering/tiering is on the rise. (An amazing number of vendors each seem to think they invented the idea.)

2. Rightly or wrongly, enterprises are often quite sloppy about analytic accuracy.

My two central examples have long been inaccurate metrics and false-positive alerts.
In predictive analytics, it’s straightforward to quantify how much additional value you’re leaving on the table with your imperfect accuracy.
Enterprise search and other text technologies are still often terrible.
After years of “real-time” overhype, organizations have seemingly swung to under-valuing real-time analytics.

3. Outside traditional enterprises, the accuracy problem can be even worse, and the consequences of analytic inaccuracy can be severe. In some cases this is well understood; autonomous vehicle researchers, for example, seem properly attentive to the challenge of not-killing-pedestrians. But in others it’s a mess. For example, I don’t think the “fake news on social media” challenge will be resolved without new technical approaches that, to my knowledge, aren’t yet even being tried.

4. More generally, I’ve long argued that the technology industry would someday have to deal with a variety of public policy and social concerns. That day has come. In anticipation, I wrote at length about privacy/surveillance, and a little about some other areas, including net neutrality, patents, economic development, and public technology spending. Missing subjects include censorship (private and public alike), and perhaps also the efforts to tie data ownership into anti-trust policy.

5. Given all the tech-specific public policy work that’s needed, I’m pulling back from some my broader political efforts. However, I stand by my overview opinions of last February, and I delivered on some of its IOUs in a two-part series on persuasion.

6. The ongoing rise of “edge computing” and the “Internet of Things” fit into the general trend that in 2013 I summarized as appliances, clusters and clouds.

7. I continue to think that a huge fraction of analytics is properly characterized as monitoring. That ties into a number of areas of interest. For example:

Platform technologies — including distributed data management — are often compete on the maturity of their built-in monitoring.
My complaints about BI inaccuracy commonly relate to use cases in monitoring.
Privacy/surveillance issues are commonly about monitoring. It’s common to worry that such monitoring is actually too accurate.
But I also worry that privacy/surveillance monitoring isn’t accurate enough … and hence that it leads to people being discriminated against who absolutely don’t need to be.
Edge computing involves a lot of devices that need to be monitored.
Censorship obviously has a lot to do with monitoring.

8. And finally for now, my core precepts for strategic messaging haven’t changed.

Related link

As you may have already guessed, the title of this post is based on a classic song.

Imanis Data

Curt Monash — Tue, 22 Aug 2017 12:46:20 +0000

I talked recently with the folks at Imanis Data. For starters:

The point of Imanis is to make copies of your databases, for purposes such as backup/restore, test/analysis, or compliance-driven archiving. (That’s in declining order of current customer activity.) Another use is migration via restoring to a different cluster than the one that created the data in the first place.
The data can come from NoSQL database managers, from Hadoop, or from Vertica. (Again, that’s in declining order.)
As you might imagine, Imanis makes incremental backups; the only full backup is the first one you do for that database.
“Imanis” is a new name; the previous name was “Talena”.

Also:

Imanis has ~35 subscription customers, a significant majority of which are in the Fortune 1000.
Customer industries, in roughly declining order, include:
- Financial services other than insurance.
- Insurance.
- Retail.
- “Technology”.
~40% of Imanis customers are in the public cloud.
Imanis is focused on the North American market at this time.
Imanis has ~45 employees.
The Imanis product just hit Version 3.

Imanis correctly observes that there are multiple reasons you might want to recover from backup, including:

General disaster/system failure.
Bug in an application that writes data.
Malicious acts, including encryption-by-ransomware.

Imanis uses the phrase “point-in-time backup” to emphasize its flexibility in letting you choose your favorite time-version of your rolling backup.

Imanis also correctly draws the inference that the right backup strategy is some version of:

Make backups very frequently. This boils down to “Do a great job of making incremental backups (and restoring from them when necessary). This is where Imanis has spent the bulk of its technical effort to date.
In case recovery is needed, identify the last clean (or provably/confidently clean) version of the database and restore from that. The identification part boils down to letting the backup databases be queried directly. That’s largely a roadmap item.
- Imanis has recently added the capability to build its own functionality querying the backup database.
- JDBC/whatever general access is still in the future.

Note: When Imanis backups offer direct query access, the possibility will of course exist to use the backup data for general query processing. But while that kind of capability sounds great in theory, I’m not aware of it being a big deal (on technology stacks that already offer it) in practice.

The most technically notable other use cases Imanis mentioned are probably:

Data science dataset generation. Imanis lets you generate a partial copy of the database for analytic or test purposes.
- You can project, select or sample your data, which suggests use of the current query capabilities.
- There’s an API to let you mask Personally Identifiable Information by writing your own data transformations.
Archiving/tiering/ILM (Information Lifecycle Management). Imanis lets you divide data according to its hotness.

Imanis views its competition as:

Native utilities of the data stores.
Hand-coded scripts.
Datos.io, principally in the Cassandra market (so far).

Beyond those, the obvious comparison to Imanis is Delphix. I haven’t spoken with Delphix for a few years, but I believe that key differences between Delphix and Imanis start:

Delphix is focused on widely-installed RDBMS such as Oracle.
Delphix actually tries to have different production logical copies of your database run off of the same physical copy. Imanis, in contrast, offers technology to help you copy your databases quickly and effectively, but the copies you actually use will indeed be separate from each other.

Imanis software runs on its own cluster, based on hacked Hadoop. A lot of the hacking seems to relate to a metadata store, which supports things like:

Understanding which (incrementally backed up) blocks need to be pulled together to make a specific copy of the database.
Putting data in different places for ILM/tiering.

Another piece of Imanis tech is machine-learning-based anomaly detection.

As incrementally backed-up blocks arrive, Imanis flags anomalous ones and states a reason for them.
A flag is given a reason.
You can denounce the flag as a false alert, and hopefully similar flags won’t be raised in the future.

The technology for this seems rather basic:

Random forests for the flagging.
No drilldown w/in the Imanis system for follow-up.

But in general concept this is something a lot more systems should be doing.

Most of the rest of Imanis’ tech story is straightforward — support various alternatives for computing platforms, offer the usual security choices, etc. One exception that was new to me was the use of erasure codes, which seem to be a generalization of the concept of parity bits. Allegedly, when used in a storage context these have the near-magical property of offering 4X replication safety with only a 1.5X expansion of data volume. I won’t claim to have understood the subject well enough to see how that could make sense, or what tradeoffs it would entail.

Analytics on the edge?

Curt Monash — Fri, 30 Jun 2017 08:27:18 +0000

There’s a theory going around to the effect that:

Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.

There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.

1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.

2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.

3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration.

There are any number of situations in which decisions are made on or about remote systems, based on models or rules that should be improved over time. For example, such decisions might be made in:

Machine vision or other “recognition”-oriented areas of AI.
Detection or prediction of malfunctions.
Choices as to what data is significant enough to ship back upstream.

In the canonical case, we might envision a system in which:

Huge amounts of data are collected and are used to make real-time decisions.
The models are trained centrally, and updated remotely over time as they are improved.
The remote systems can only ship back selected or aggregated data to help train the models.

This all seems like an awkward fit for any common computing architecture I can think of.

But it’s hard to pin down important examples of that “canonical” case. The story implicitly assumes:

A model is widely deployed.
The model does a decent job but not a perfect one.
Based on its successes and failures, the model gets improved.

And now we’re begging a huge question: What exactly is there that keeps score as to when the model succeeds and fails? Mathematically speaking, I can’t imagine what a general answer would be like.

4. So when it comes to predictive models executed on real-world appliances I think that analytic workflows will:

Differ for different (categories) of applications.
Rely in most cases on simple patterns of data movement, such as:
- Stream everything to central servers and sort it out there, or if that’s not workable …
- … instrument a limited number of test nodes to store everything, and recover the data in batch for analysis.
- Update models only in timeframes that you’re doing a full app update/refresh.

And with that much of the apparent need for fancy distributed analytic architectures evaporates.

5. Finally, and notwithstanding the previous point: Across many use cases, there’s some kind of remote log data being shipped back to a central location. It may be the complete log. It may be periodic aggregates. It may happen only what the edge nodes regard as significant events. But something is getting shipped home.

The architectures for shipping, receiving and analyzing such data are in many cases immature. That’s obvious if there’s any kind of streaming involved, or if analysis is done in Spark. Ditto if there’s anything we might call “non-tabular business intelligence”. As this stuff matures, it will in many cases fit very well with today’s cloud thinking. But in any case — it needs to mature.

Truth be told, even the relational case is immature, in that it can easily rely on what I called:

data warehouses (perhaps really data marts) that are updated in human real-time

That quote is from a recent post about Kudu, which:

Is designed for exactly that use case.
Went GA early this year.

As always, technology is in flux.

Related links

Interana is another example of very new technology that seems applicable to these use cases.
My 2013 post on the future of IT architectures still rings true.

Generally available Kudu

Curt Monash — Fri, 16 Jun 2017 15:52:45 +0000

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:

Security is an ever bigger deal.
There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.

Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:

A data storage system introduced by Cloudera (and subsequently open-sourced).
Columnar.
Updatable in human real-time.
Meant to serve as the data storage tier for Impala and Spark.

Kudu’s adoption and roll-out story starts:

Kudu went to general availability on January 31. I gather this spawned an uptick in trial activity.
A subsequent release with some basic security features spawned another uptick.
I don’t think Cloudera will mind my saying that there are many hundreds of active Kudu clusters.
But Cloudera believes that, this soon after GA, very few Kudu users are in actual production.

Early Kudu interest is focused on 2-3 kinds of use case. The biggest is the kind of “data warehousing” highlighted above. Cloudera characterizes the others by the kinds of data stored, specifically the overlapping categories of time series — including financial trading — and machine-generated data. A lot of early Kudu use is with Spark, even ahead of (or in conjunction with) Impala. A small amount has no relational front-end at all.

Other notes on Kudu include:

Solid-state storage is recommended, with a few terabytes per node.
You can also use spinning disk. If you do, your write-ahead logs can still go to flash.
Cloudera said Kudu compression ratios can be as low as 2-5X, or as high as 10-20X. With that broad a range, I didn’t drill down into specifics of what they meant.
There seem to be a number of Kudu clusters with 50+ nodes each. By way of contrast, a “typical” Cloudera customer has 100s of nodes overall.
As you might imagine from their newness, Kudu security features — Kerberos-based — are at the database level rather than anything more granular.

And finally, the Cloudera folks woke me up to some issues around streaming data ingest. If you stream data in, there will be retries resulting in duplicate delivery. So your system needs to deal with those one way or another. Kudu’s way is:

Primary keys will be unique. (Note: This is not obvious in a system that isn’t an entire RDBMS in itself.)
You can configure the uniqueness to be guaranteed either through an upsert mechanism or just by simply rejecting duplicates.
Alternatively, you can write code to handle duplication errors, e.g. via Spark.

Interana

Curt Monash — Mon, 17 Apr 2017 10:10:41 +0000

Interana has an interesting story, in technology and business model alike. For starters:

Interana does ad-hoc event series analytics, which they call “interactive behavioral analytics solutions”.
Interana has a full-stack analytic offering, include:
- Its own columnar DBMS …
- … which has a non-SQL DML (Data Manipulation Language) meant to handle event series a lot more fluently than SQL does, but which the user is never expected to learn because …
- … there also are BI-like visual analytics tools that support plenty of drilldown.
Interana sells all this to “product” departments rather than marketing, because marketing doesn’t sufficiently value Interana’s ad-hoc query flexibility.
Interana boasts >40 customers, with annual subscription fees ranging from high 5 figures to low 7 digits.

And to be clear — if we leave aside any questions of marketing-name sizzle, this really is business intelligence. The closest Interana comes to helping with predictive modeling is giving its ad-hoc users inspiration as to where they should focus their modeling attention.

Interana also has an interesting twist in its business model, which I hope can be used successfully by other enterprise software startups as well.

For now, at no extra charge, Interana will operate its software for you as a managed service. (A majority of Interana’s clients run the software on Amazon or Azure, where that kind of offering makes sense.)
However, presumably in connection with greater confidence in its software’s ease of administration, Interana will move this year toward unbundling the service as an extra-charge offering on top of the software itself.

The key to understanding Interana is its DML. Notes on that include:

Interana’s DML is focused on path analytics …
- … but Interana doesn’t like to use that phrase because it sounds too math-y and difficult.
- Interana may be the first company that’s ever told me it’s focused on providing a better nPath.
Primitives in Interana’s language — notwithstanding the company’s claim that it never ever intended to sell to marketing departments — include familiar web analytics concepts such as “session”, “funnel” and so on. (However, these are being renamed to more neutral terms such as “flow” in an upcoming version of the product.)
As typical example questions or analytic subjects, Interana offered:
- “Which are the most common products in shopping carts where time-to-checkout was greater than 30 minutes?”
- Exactly which steps in the onboarding process result in the greatest user frustration?
The Interana folks and I agree that Splunk is the most recent example of a new DML kicking off a significant company.
The most recent example I can think of in which a vendor hung its hat on a new DML that was a “visual programming language” is StreamBase, with EventFlow. That didn’t go all that well.
To use Founder/CTO Bobby Johnson’s summary term, the real goal of the Interana language is to describe a state machine, specifically one that produces (sets of) sequences of events (and the elapsed time between them).

Notes on Interana speeds & feeds include:

Interana only promises data freshness up to micro-batch latencies — i.e., a few minutes. (Obviously, this shuts them out of most networking monitoring and devops use cases.)
Interana thinks it’s very important for query response time to max out at a low number of seconds. If necessary, the software will return approximate results rather than exact ones so as to meet this standard.
Interana installations and workloads to date have gotten as large as:
- 1-200 nodes.
- Trillions of rows, equating to 100s of TBs of data after compression/ >1 PB uncompressed.
- Billions of rows/events received per day.
- 100s of 1000s of (very sparse) columns.
- 1000s of named users.

Although Interana’s original design point was spinning disk, most customers store their Interana data on flash.

Interana architecture choices include:

They’re serious about micro-batching.
- If the user’s data is naturally micro-batched — e.g. a new S3 bucket every few minutes — Interana works with that.
- Even if the customer’s data is streamed — e.g. via Kafka — Interana insists on micro-batching it.
They’re casual about schemas.
- Interana assumes data arrives with some kind of recognizable structure, via JSON, CSV or whatever.
  - Interana observes, correctly, that log data often is decently structured.
    - For example, if you’re receiving “phone home” pings from products you originally manufactured, you know what data structures to expect.
    - Interana calls this “logging with intent”.
  - Interana is fine with a certain amount of JSON (for example) schema change over time.
  - If your arriving data truly is a mess, then you need to calm it down via a pass through Splunk or whatever before sending it to Interana.
- JSON hierarchies turn into multi-part column names in the usual way.
- Interana supports one level of true nesting, and one level only; column values can be “lists”, but list values can’t be list themselves.

Finally, other Interana tech notes include:

Compression is a central design consideration …
- … especially but not only compression algorithms designed to deal with great sparseness, such as run-length encoding (RLE).
- Dictionary compression, in a strategy that is rarer than I once expected it to be, uses a global rather than shard-by-shard dictionary. The data Interana expects is of low-enough cardinality for this to be the better choice.
- Column data is sorted. A big part of the reason is of course to aid compression.
- Compression strategies are chosen automatically for each segment. Wholly automatically, I gather; you can’t tune the choice manually.
As you would think, Interana technically includes multiple data stores.
- Data first hits a write-optimized store. Unlike the case of Vertica, this WOS never is involved in answering queries.
- Asynchronously, the data is broken into columns, and banged to “disk”.
- Asynchronously again, the data is sorted.
- Queries run against sorted data, sorting recent blocks on-the-fly if necessary.
Interana lets you shard different replicas of the data according to different shard keys.
Interana is proud of the random sampling it does when serving approximate query results.

Introduction to SequoiaDB and SequoiaCM

Curt Monash — Sun, 12 Mar 2017 18:19:49 +0000

For starters, let me say:

SequoiaDB, the company, is my client.
SequoiaDB, the product, is the main product of SequoiaDB, the company.
SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
… and is usually sold for RDBMS-like use cases …
… except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
SequoiaDB’s products are open source.
SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.

Also:

SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
- SequoiaDB’s founders were Chinese nationals.
- Most of them went back to China.
- Other employees to date have been entirely Chinese.
- Sales to date have been entirely in China, but SequoiaDB has international aspirations
SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)

Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.

While I usually think that the advantages of open source are overstated, in SequoiaDB’s case open source will have* an additional benefit when SequoiaDB does go international — it addresses any concerns somebody might have about using Chinese technology.

*Edit: Actually, this claim is overstated based on SequoiaDB’s current open source practices. Please see the comment thread below.

SequoiaDB’s technology story starts:

SequoiaDB is a layered DBMS.
It manages JSON via update-in-place. MVCC (Multi-Version Concurrency Control) is on the roadmap.
Indexes are B-tree.
Transparent sharding and elasticity happen in what by now is the industry-standard/best-practices way:
- There are many (typically 4096) logical partitions, many of which are assigned to each physical partition.
- If the number of physical partitions changes, logical partitions are reassigned accordingly.
Relational OLTP (OnLine Transaction Processing) functionality is achieved by using a kind of PostgreSQL front end.
Relational batch processing is done via SparkSQL.
There also is a block/LOB (Large OBject) storage engine meant for content management applications.
SequoiaCM boils down technically to:
- SequoiaDB, which is used to store JSON metadata about the LOBs …
- … and whose generic-DBMS coordination capabilities are also used over the block/LOB engine.
- A Java library focused on content management.

SequoiaDB’s relationship with PostgreSQL is complicated, but as best I understand SequoiaDB’s relational operations:

SQL parsing, optimization, and so on rely mainly on PostgreSQL code. (Of course, there are some hacks, such as to the optimizer’s cost functions.)
Actual data storage is done via SequoiaDB’s JSON store, using PostgreSQL Foreign Data Wrappers. Each record goes in a separate JSON document. Locks, commits and so on — i.e. “write prevention” — are handled by the JSON store.
PostgreSQL’s own storage engine is actually part of the stack, but only to manage temp space and the like.

PostgreSQL stored procedures are already in the SequoiaDB product. Triggers and referential integrity are not. Neither, so far as I can tell, are PostgreSQL’s datatype extensibility capabilities.

I neglected to ask how much of that remains true when SparkSQL is invoked.

SequoiaDB’s use cases to date seem to fall mainly into three groups:

Content management via SequoiaCM.
“Operational data lakes”.
Pretty generic replacement of legacy RDBMS.

Internet back-ends, however — and this is somewhat counter-intuitive for an open-source JSON store — are rare, at least among paying subscription customers. But SequoiaDB did tell me of one classic IoT (Internet of Things) application, with lots of devices “phoning home” and the results immediately feeding a JSON-based dashboard.

To understand SequoiaDB’s “operational data lake” story, it helps to understand the typical state of data warehousing at SequoiaDB’s customers and prospects, which isn’t great:

2-3 years of data, and not all the data even from that time period.
Only enough processing power to support structured business intelligence …
… and hence little opportunity for ad-hoc query.

SequoiaDB operational data lakes offer multiple improvements over that scenario:

They hold as much relational data as customers choose to dump there.
That data can be simply copied from operational stores, with no transformation.
Or if data arrives via JSON — from external organizations or micro-services as the case may be — the JSON can be stored unmodified as well.
Queries can be run straight against this data soup.
Of course, views can also be set up in advance to help with querying.

Views are particularly useful with what might be called slowly changing schemas. (I didn’t check whether what SequoiaDB is talking about matches precisely with the more common term “slowly changing dimensions”.) Each time the schema changes, a new table is created in SequoiaDB to receive copies of the data. If one wants to query against the parts of the database structure that didn’t change — well, a view can be establish to allow for that.

Finally, it seems that SequoiaCM uses are concentrated in what might be called “security and checking-up” areas, such:

Photographs as part of an authentication process.
Video of in-person banking transactions, both for fraud prevention and for general service quality assurance.
Storage of security videos (for example from automated teller machines).

SequoiaCM deals seem to be bigger than other SequoiaDB ones, surely in part because the amounts of data managed are larger.

Introduction to Crate.io and CrateDB

Curt Monash — Sun, 18 Dec 2016 05:27:15 +0000

Crate.io and CrateDB basics include:

Crate.io makes CrateDB.
CrateDB is a quasi-RDBMS designed to receive sensor data and similar IoT (Internet of Things) inputs.
CrateDB’s creators were perhaps a little slow to realize that the “R” part was needed, but are playing catch-up in that regard.
Crate.io is an outfit founded by Austrian guys, headquartered in Berlin, that is turning into a San Francisco company.
Crate.io says it has 22 employees and 5 paying customers.
Crate.io cites bigger numbers than that for confirmed production users, clearly active clusters, and overall product downloads.

In essence, CrateDB is an open source and less mature alternative to MemSQL. The opportunity for MemSQL and CrateDB alike exists in part because analytic RDBMS vendors didn’t close it off.

CrateDB’s not-just-relational story starts:

A column can contain ordinary values (of usual-suspect datatypes) or “objects”, …
… where “objects” presumably are the kind of nested/hierarchical structures that are common in the NoSQL/internet-backend world, …
… except when they’re just BLOBs (Binary Large OBjects).
There’s a way to manually define “strict schemas” on the structured objects, and a syntax for navigating their structure in WHERE clauses.
There’s also a way to automagically infer “dynamic schemas”, but it’s simplistic enough to be more suitable for development/prototyping than for serious production.

Crate gave an example of data from >800 kinds of sensors being stored together in a single table. This leads to significant complexity in the FROM clauses. But querying the same data in a relational schema would be at least as complicated, and probably worse.

One key to understanding Crate’s architectural choices is to note that they’re willing to have different latency/consistency standards for:

Writes and single-row look-ups.
Aggregates and joins.

And so it makes sense that:

Data is banged into CrateDB in a NoSQL-ish kind of way as it arrives, with RYW consistency.
The indexes needed for SQL functionality are updated in microbatches as soon as possible thereafter. (Think 100 milliseconds as a base case.) Crate.io characterizes the consistency for this part as “eventual”.

CrateDB will never have real multi-statement transactions, but it has simpler levels of isolation that may be called “transactions” in some marketing contexts.

CrateDB technical highlights include:

CrateDB records are stored as JSON documents. (Actually, I didn’t ask whether this was true JSON or rather something “JSON-like”.)
- In the purely relational case, the documents may be regarded as glorified text strings.
- I got the impression that BLOB storage was somewhat separate from the rest.
CrateDB’s sharding story starts with consistent hashing.
- Shards are physical-only. CrateDB lacks the elasticity-friendly feature of there being many logical shards for each physical shard.
- However, you can change your shard count, and any future inserts will go into the new set of shards.
In line with its two consistency models, CrateDB also has two indexing strategies.
- Single-row/primary-key lookups have a “forward lookup” index, whatever that is.
- Tables also have a columnar index.
  - More complex queries and aggregations are commonly done straight against the columnar index, rather than the underlying data.
  - CrateDB’s principal columnar indexing strategy sounds a lot like inverted-list, which in turn is a lot like standard text indexing.
  - Specific datatypes — e.g. geospatial — can be indexed in different ways.
- The columnar index is shard-specific, and located at the same node as the shard.
- At least the hotter parts of the columnar index will commonly reside in memory. (I didn’t ask whether this was via straightforward caching or some more careful strategy.)
While I didn’t ask about CrateDB’s replication model in detail, I gathered that:
- Data is written synchronously to all nodes. (That’s sort of implicit in RYW consistency anyway.)
- Common replication factors are either 1 or 3, depending on considerations such as the value of the data. But as is usual, some tables can be replicated across all nodes.
- Data can be read from all replicas, for obvious reasons of performance.
Where relevant — e.g. the wire protocol or various SQL syntax specifics — CrateDB tends to emulate Postgres.
The CrateDB stack includes Elasticsearch and Lucene, both of which make sense in connection with Crate’s text/document orientation.

Crate.io is proud of its distributed/parallel story.

Any CrateDB node can plan a query. Necessary metadata for that is replicated across the cluster.
Execution starts on a shard-by-shard basis. Data is sorted at each shard before being sent onward.
Crate.io encourages you to run Spark and CrateDB on the same nodes.
- This is supported by parallel Spark-CrateDB integration of the obvious kind.
- Crate.io notes a happy synergy to this plan, in that Spark stresses CPU while CrateDB is commonly I/O-bound.

The CrateDB-Spark integration was the only support I could find for various marketing claims about combining analytics with data management.

Given how small and young Crate.io is, there are of course many missing features in CrateDB. In particular:

A query can only reshuffle data once. Hence, CrateDB isn’t currently well-designed for queries that join more than 2 tables together.
The only join strategy currently implemented is nested loop. Others are in the future.
CrateDB has most of ANSI SQL 92, but little or nothing specific to SQL 99. In particular, SQL windowing is under development.
Geo-distribution is still under development (even though most CrateDB data isn’t actually about people).
I imagine CrateDB administrative tools are still rather primitive.

In any case, creating a robust DBMS is an expensive and time-consuming process. Crate has a long road ahead of it.

Edit: For some clarification and even correction, please see the first comment below.

DBAs of the future

Curt Monash — Wed, 23 Nov 2016 12:02:31 +0000

After a July visit to DataStax, I wrote

The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.

Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

That turns out to understate the core point, which is that DBAs still matter in non-RDBMS environments. Specifically, it’s too narrow in two ways.

First, it’s generally too narrow as to what DBAs do; people with DBA-like skills are also involved in other areas such as “data governance”, “information lifecycle management”, storage, or what I like to call data mustering.
Second — and more narrowly — the first bullet point of the quote is actually incorrect. In fact, the database design part of application development can be done by a specialized person up front in the NoSQL world, just as it commonly is for RDBMS apps.

My wake-up call for that latter bit was a recent MongoDB 3.4 briefing. MongoDB certainly has various efforts in administrative tools, which I won’t recapitulate here. But to my surprise, MongoDB also found a role for something resembling relational database design. The idea is simple: A database administrator defines a view against a MongoDB database, where views:

Are logical rather than materialized. (At least at this time.)
Have their permissions and so on set by the DBA.
Are the sole thing the programmer writes against.

Besides the obvious benefits in development ease and security, MongoDB says that performance can be better as well.* This is of course a new feature, without a lot of adoption at this time. Even so, it seems likely that NoSQL doesn’t obsolete any part of the traditional DBA role.

*I didn’t actually ask what a naive programmer can do to trash performance that views can forestall, but … well, I was once a naive programmer myself.

Two trends that I think could make DBA’s lives even more interesting and challenging in the future are:

The integration of quick data management into complex analytic processes. Here by “quick data management” I mean, for example, what you do in connection with a complex Hadoop or Spark (set of) job(s). Leaving the data management to a combination of magic and Python scripts doesn’t seem to respect how central data operations are to analytic tasks.
The integration of data management and streaming. I should probably write about this point separately, but in any case — it seems that streaming stacks will increasingly look like over-caffeinated DBMS.

Bottom line: Database administration skills will be needed for a long time to come.