Facebook – DBMS 2 : DataBase Management System Services

The technology industry is under broad political attack

Curt Monash — Fri, 15 Dec 2017 09:25:55 +0000

I apologize for posting a December downer, but this needs to be said.

The technology industry is under attack:

From politicians and political pundits …
… especially from “populists” and/or the political right …
… in the United States and other countries.

These attacks:

Are in some cases specific to internet companies such as Google and Facebook.
In some cases threaten the tech industry more broadly.
Are in some cases part of general attacks on the educated/ professional/“globalist”/”coastal” “elites”.

You’ve surely noticed some of these attacks. But you may not have noticed just how many different attacks and criticisms there are, on multiple levels.

1. Concerns about jobs, disruption, gentrification and so on are a Really Big Deal, causing large swaths of the population to regard technology as bad for their pocketbooks. In particular:

There’s tremendous concern about job loss to automation and/or globalization. Technology helps cause the first and enable the second.
Generally, when an industry destroys jobs, one hopes that it will create new ones to take their place. But while US technology companies have created many jobs, a lot of those are overseas.
Flaps about overseas finances, taxes, and so on aren’t helping. Apple, for example, has major issues in Europe and the US alike.
Working-class jobs that tech companies do create are often attacked for their pay and conditions, e.g. for Amazon warehouse workers or Uber drivers.
Even when the technology industry unquestionably creates good, domestic jobs, the industry may be attacked for them. Consider for example the concerns about cost of living/gentrification in Northern California.
“Sharing economy” companies such as Uber and Airbnb and others are involved in local political fights all around the world, as they undercut traditional service providers.

People who believe that technologists harm them are a major political force.

2. The technology industry is under considerable legislative, regulatory, and judicial pressure. For starters:

Tech companies are attacked for doing too little to aid law enforcement and government surveillance.
Tech companies are attacked for doing too much to aid law enforcement and government surveillance.
Tech companies are attacked for doing too little censorship.
Tech companies are attacked for doing too much censorship.
Privacy regulations are ever-changing.

Complicating things further, these challenges take different forms in different countries around the world.

Also:

China pressures foreign vendors to transfer technology into China.
Recent network neutrality developments in the US favor older telecom providers, at the expense of newer internet companies.
Anti-immigrant policies in the US threaten tech vendors.

I could keep going much longer than that. Government relations are a major, major issue for tech.

3. It is traditional to claim that advances in communication/media technologies will wreck society.

Television was going to make us mass-conformist couch potatoes.
Video games were going to make us violent couch potatoes.

This era brings similar concerns.

Social media makes us couch potatoes sitting in niche-conformist echo chambers.
Modern media over-stimulate us and wreck our attention spans.

I.e., the apocalypse is imminent, and tech is what will bring it on.

The most compelling version of that argument I’ve seen is Jean Twenge’s claims that there’s a teen mental health crisis perfectly matched in time to the rise of the smartphone. And to make any such claim seem particularly damning, please recall: Social media and gaming companies are clearly trying to foster a form of addiction in — well, in their users.

Current concern may ebb just like previous generations’ did. But for now, they’re yet another aspect of a threat-filled environment.

4. What worries me most is this: The United States and other countries face relentless attacks on education, educators, science, scientists, and rationality itself. And there are no obvious limits to how bad these can get. China’s Cultural Revolution and the Cambodian genocide happened during my lifetime. Stalin and Hitler ruled during my parents’. All four took particular aim at people like us.

Bottom line: EVERYBODY in the technology industry should be or quickly become politically aware. We have an awful lot of politics to deal with.

Notes on vendor lock-in

Curt Monash — Wed, 20 Jul 2016 01:35:32 +0000

Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.

1. The most basic form of lock-in is:

You do application development for a target set of platform technologies.
Your applications can’t run without those platforms underneath.
Hence, you’re locked into those platforms.

2. Enterprise vendor standardization is closely associated with lock-in. The core idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:

That simplifies your environment, requiring less integration and interoperability.
That simplifies your staffing; the same skill sets apply to multiple needs and projects.
That simplifies your vendor support relationships; there’s “one throat to choke”.
That simplifies your price negotiation.

3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:

For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
A few years later, the price is renegotiated, based on then-current levels of usage.

Thus, doing an additional project using ELAed products may appear low-cost.

Incremental license and maintenance fees may be zero in the short-term.
Incremental personnel costs may be controlled because the needed skills are already in-house.

Often those appearances are substantially correct. That’s a big reason why incumbent software is difficult to supplant unless the upstart substitute is superior in fundamental and important ways.

4. Subscriptions are closely associated with lock-in.

Most obviously, the traditional software industry gets its profits from high-margin support/maintenance services.
Cloud lock-in has rapidly become a big deal.
The open source vendors meeting lock-in resistance, noted below, have subscription business models.

Much of why customers care about lock-in is the subscription costs it’s likely to commit them to.

5. Also related to lock-in are thick single-vendor technology stacks. If you run Oracle applications, you’re going to run the Oracle DBMS too. And if you run that, you’re likely to run other Oracle software, and perhaps use Exadata hardware as well. The cloud ==> lock-in truism is an example of this point as well.

6. There’s a lot of truth to the generality that central IT cares about overall technology architecture, while line-of-business departments just want to get the job done. This causes departments to both:

Oppose standardization.
Like thick technology stacks.

Thus, departmental influence on IT both encourages and discourages lock-in.

7. IBM is all about lock-in. IBM’s support for Linux, Eclipse and so on don’t really contradict that. IBM’s business model is to ~~squeeze~~ serve its still-large number of strongly loyal customers as well as it can.

8. Microsoft’s business model over the decades has also greatly depended on lock-in.

Indeed, it exploited Windows/Office lock-in so vigorously as to incur substantial anti-trust difficulties.
Server-side Windows tends to be involved in thick stacks — DBMS, middleware, business intelligence, SharePoint and more. Many customers (smaller enterprises or in some cases departments) are firmly locked into these stacks.
Microsoft is making a strong cloud push with Azure, which inherently involves lock-in.

Yet sometimes, Microsoft is more free and open.

Office for Macintosh allowed the Mac to be a viable Windows competitor. (And Microsoft was well-paid for that, generating comparable revenue per Mac to what it got for each Windows PC.)
Visual Studio is useful for writing apps to run against multiple DBMS.
Just recently, Microsoft SQL Server was ported to Linux.

9. SAP applications run over several different DBMS, including its own cheap MaxDB. That counteracts potential DBMS lock-in. But some of its newer apps are HANA-specific. That, of course, has the opposite effect.

10. And with that as background, we can finally get to what led me to finally write this post. Multiple clients have complaints that may be paraphrased as:

Customers are locked into expensive traditional DBMS such as Oracle.
Yet they’re so afraid of lock-in now that they don’t want to pay for our vendor-supplied versions of open source database technologies; they prefer to roll their own.
Further confusing matters, they also are happy to use cloud technologies, including the associated database technologies (e.g. . Redshift or other Amazon offerings), creating whole new stacks of lock-in.

So open source vendors of NoSQL data managers and similar technologies felt like they were the only kind of vendor suffering from fear of lock-in.

I agree with them that enterprises who feel this way are getting it wrong. Indeed:

The management of even NoSQL DBMS is a big issue, and help in that area has high cash value for customers.
Serious users need support.
Support and management tools happen to be synergistic with each other.

This is the value proposition that propelled Cloudera. It’s also a strong reason to give money to whichever of MongoDB, DataStax, Neo Technology et al. sponsors open source technology that you use.

General disclosure: My fingerprints have been on this industry strategy since before the term “NoSQL” was coined. It’s been an aspect of many different consulting relationships.

Some enterprises push back, logically or emotionally as the case may be, by observing that the best internet companies — e.g., Facebook — are allergic to paying for software, even open source. My refutations of that argument include:

Facebook has more and better engineers than you do.
Facebook has a lot more servers than you do, and would presumably face much higher prices than you would if you each chose to forgo the in-house alternative.
Facebook pays for open source software in a different way than through subscription fees — it invents and enhances it. Multiple important projects have originated at Facebook, and it contributes to many others. Are you in a position to do the same thing?

And finally — most of Facebook’s users get its service for free. (Advertisers are the ones who pay cash; all others just pay in attention to the ads.) So if getting its software for free actually does screw up its SLAs (Service Level Agreements) — well, free generally comes with poorer SLAs than paid. But if you’re in the business of serving paying customers, then you might want to have paying-customer kinds of SLAs, even on the parts of your technology — e.g. websites urging people to do business with you — that you provide for free yourself.

Related links

The technology underlying packaged applications (November, 2015, but it has a historical focus)
Topics in migration (January, 2015)
Much of the vendor advice on Strategic Messaging.

Notes from a long trip, July 19, 2016

Curt Monash — Wed, 20 Jul 2016 01:34:31 +0000

For starters:

I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.

A running list of recent posts is:

As a companion to this post, I’m publishing a very long one on vendor lock-in.
Spark and Databricks are both prospering, and of course enhancing their technology as well.
Ditto DataStax.
Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.

Subjects I’d like to add to that list include:

MemSQL, Zoomdata, and Neo Technology (also prospering).
Cloudera (multiple topics, as usual).
Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
Microsoft’s reinvention (it feels real).
Metadata (it’s ever more of a thing).
Machine learning (it’s going to be a big portion of my research going forward).
Transitions to the cloud — this subject affects almost everything else.

I’ll edit these lists as appropriate when further posts go up. Last update: August 23, 2016.

Let’s cover some other subjects right here.

1. While Kafka is widely agreed to be the universal delivery mechanism for streams, the landscape for companion technologies is confused.

Back in January I wrote that the leaders were mainly Spark Streaming, followed by Storm.
I overlooked the fact that Storm creator Twitter was replacing Storm with something called Heron.*
If there’s any buzz about Confluent’s replacement for distant-third-place contender Samza, I missed it.
Opinions about Spark Streaming are mixed. Some folks want to get away from it; others like it just fine.

And of course Flink is hoping to blow everybody else in the space away.

*But that kind of thing is not necessarily a death knell. Cassandra inventor Facebook soon replaced Cassandra with HBase, yet Cassandra is doing just fine.

As for the “lambda architecture” — that has always felt like a kludge, and various outfits are trying to obsolete it in various ways. As just one example, Cloudera described that to me during my visit as one of the main points of Kudu.

2. The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.

Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.

I had a moment of clarity on this point while visiting my clients at DataStax, and discussing their goal — shared by numerous companies — of being properly appreciated for the management tools they provide. In the room with me were CEO Billy Bosworth and chief evangelist Patrick McFadin — both of whom are former DBAs themselves.

3. I visited ClearStory, and Sharmila Mulligan showed me her actual sales database, as well as telling me some things about funding. The details are all confidential, but ClearStory is clearly doing better than rumor might suggest.

4. Platfora insisted on meeting circumstances in which it was inconvenient for me to take notes. So I have no details to share. But they sounded happy.

Edit: On July 22, it was announced that Workday is acquiring Platfora. Now I understand why Platfora gave me a bit of a runaround.

5. Pneubotics — with a cool new video on its home page — has found its first excellent product/market fit. Traditional heavy metallic robots are great at painting and related tasks when they can remain stationary, or move on rigid metal rails. Neither of those options works well, however, for large curved or irregular surfaces as might be found in the aerospace industry. Customer success for the leading soft robot company has ensued.

This all seems pretty close to the inspection/maintenance/repair area that I previously suggested could be a good soft robotics fit.

Machine learning’s connection to (the rest of) AI

Curt Monash — Tue, 01 Dec 2015 09:28:22 +0000

This is part of a four post series spanning two blogs.

One post gives a general historical overview of the artificial intelligence business.
One post specifically covers the history of expert systems.
One post gives a general present-day overview of the artificial intelligence business.
One post (this one) explores the close connection between machine learning and (the rest of) AI.

1. I think the technical essence of AI is usually:

Inputs come in.
Decisions or actions come out.
More precisely — inputs come in, something intermediate is calculated, and the intermediate result is mapped to a decision or action.
The intermediate results are commonly either numerical (a scalar or perhaps a vector of scalars) or a classification/partition into finitely many possible intermediate outputs.

Of course, a lot of non-AI software can be described the same way.

To check my claim, please consider:

It fits rules engines/expert systems so simply it’s barely worth saying.
It fits any kind of natural language processing; the intermediate results might be words or phrases or concepts or whatever.
It fits machine vision beautifully.

To see why it’s true from a bottom-up standpoint, please consider the next two points.

2. It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response. Examples of what I mean include:

Think of what’s on an IQ test, or a commonly accepted substitute for same. (The SAT sometimes substitutes.) A lot of that is pattern recognition.
When the “multiple intelligences” or just “emotional intelligence” concepts gained currency, the core idea was the recognition of various different kinds of pattern. (E.g., reading somebody else’s emotions, something that I’m not nearly as good at as I am at the skills measured by standard IQ tests.)
The central mechanism of neurotransmission is a neuron recognizing that an action potential has crossed a certain threshold, and firing as a result.
Traditional areas of AI include natural language recognition, machine vision, and so on.
Another traditional area of AI is rules-based processing — conditions in, decision out.
Back in the 1980s (less so today), it was thought that a core underpinning for AI technology was knowledge representation. That said, as much as I like interesting data structures, I have my doubts.
- The Semantic Web grew out of this idea.
- Also, the single most enduring proponent of the centrality of knowledge representation was probably Doug Lenat, who gave his name to a famed unit of bogosity.
- While the previous two points are probably just coincidence, the juxtaposition is suggestive.

3. In most computational cases, pattern recognition and response boil down to scoring and/or classification (whether in a narrow machine learning sense of “classification” or otherwise). What I mean by this is:

I’m thinking of scoring as a function that maps inputs into scalar values. (Or a vector of scalars.)
I’m thinking of classification as a function that maps inputs into a finite range of possible values. (Note that this is mathematically equivalent to a finite partition on the set of inputs.)
I’m also assuming that the system maps each possible score or classification to a decision or response (deterministically or probabilistically as the case may be).
Then if you compose the two maps, you wind up with a function from {possible input patterns} to {possible responses}.

4. If you want a good algorithm for classification, of course, it’s natural to pursue it via machine learning. And the same is true of scoring, at least if we recall that the domains of machine learning and statistics have essentially merged.

5. It took people remarkably long to figure out the previous point. Through at least the end of the previous century, it was generally assumed that the way to come up with clever algorithms for, for example, text analytics or machine vision was — well, to think them up.

6. As spelled out in my overview of present-day commercial AI, there’s a somewhat paradoxical industry structure, in that:

Even though machine learning is a sine qua non of many businesses, tech and non-tech alike …
… the rest of AI is largely concentrated at a few behemoth technology companies.

Of course, there are plenty of startups hoping to change that structure. I hope some of them succeed.

Teradata will support Presto

Curt Monash — Mon, 08 Jun 2015 09:32:16 +0000

At the highest level:

Presto is, roughly speaking, Facebook’s replacement for Hive, at least for queries that are supposed to run at interactive speeds.
Teradata is announcing support for Presto with a classic open source pricing model.
Presto will also become, roughly speaking, Teradata’s replacement for Hive.
Teradata’s Presto efforts are being conducted by the former Hadapt.

Now let’s make that all a little more precise.

Regarding Presto (and I got most of this from Teradata)::

To a first approximation, Presto is just another way to write SQL queries against HDFS (Hadoop Distributed File System). However …
… Presto queries other data stores too, such as various kinds of RDBMS, and federates query results.
Facebook at various points in time created both Hive and now Presto.
Facebook started the Presto project in 2012 and now has 10 engineers on it.
Teradata has named 16 engineers – all from Hadapt – who will be contributing to Presto.
Known serious users of Presto include Facebook, Netflix, Groupon and Airbnb. Airbnb likes Presto well enough to have 1/3 of its employees using it, via an Airbnb-developed tool called Airpal.
Facebook is known to have a cluster cited at 300 petabytes and 4000 users where Presto is presumed to be a principal part of the workload.

Daniel Abadi said that Presto satisfies what he sees as some core architectural requirements for a modern parallel analytic RDBMS project:

Data is pipelined between operators, with no gratuitous writing to disk the way you might have in something MapReduce-based. This is different from the sense of “pipelining” in which one query might keep an intermediate result set hanging around because another query is known to need those results as well.
Presto processing is vectorized; functions don’t need to be re-invoked a tuple at a time. This is different from the sense of vectorization in which several tuples are processed at once, exploiting SIMD (Single Instruction Multiple Data). Dan thinks SIMD is useful mainly for column stores, and Presto tries to be storage-architecture-agnostic.
Presto query operators and hence query plans are dynamically compiled, down to byte code.
Although it is generally written in Java, Presto uses direct memory management rather than relying on what Java provides. Dan believes that, despite being written in Java, Presto performs as if it were written in C.

More precisely, this is a checklist for interactive-speed parallel SQL. There are some query jobs long enough that Dan thinks you need the fault-tolerance obtained from writing intermediate results to disk, ala’ HadoopDB (which was of course the MapReduce-based predecessor to Hadapt).

That said, Presto is a newish database technology effort, there’s lots of stuff missing from it, and there still will be lots of stuff missing from Presto years from now. Teradata has announced contribution plans to Presto for, give or take, the next year, in three phases:

Phase 1 (released immediately, and hence in particular already done):
- An installer.
- More documentation, especially around installation.
- Command-line monitoring and management.
Phase 2 (later in 2015)
- Integrations with YARN, Ambari and soon thereafter Cloudera Manager.
- Expanded SQL coverage.
Phase 3 (some time in 2016)
- An ODBC driver, which is of course essential for business intelligence tool connectivity.
- Other connectors (e.g. more targets for query federation).
- Security.
- Further SQL coverage.

Absent from any specific plans that were disclosed to me was anything about optimization or other performance hacks, and anything about workload management beyond what can be gotten from YARN. I also suspect that much SQL coverage will still be lacking after Phase 3.

Teradata’s basic business model for Presto is:

Teradata is selling subscriptions, for which the principal benefit is support.
Teradata reserves the right to make some of its Presto-enhancing code subscription-only, but has no immediate plans to do so.
Teradata being Teradata, it would love to sell you Presto-related professional services. But you’re absolutely welcome to consume Presto on the basis of license-plus-routine-support-only.

And of course Presto is usurping Hive’s role wherever that makes sense in Teradata’s data connectivity story, e.g. Teradata QueryGrid.

Finally, since I was on the phone with Justin Borgman and Dan Abadi, discussing a project that involved 16 former Hadapt engineers, I asked about Hadapt’s status. That may be summarized as:

There are currently no new Hadapt sales.
Only a few large Hadapt customers are still being supported by Teradata.
The former Hadapt folks would love Hadapt or Hadapt-like technology to be integrated with Presto, but no such plans have been finalized at this time.

More notes on HBase

Curt Monash — Tue, 17 Mar 2015 18:13:50 +0000

1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:

Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
Files have indexes and optional Bloom filters.
Files are marked with min/max field values and time stamp ranges, which helps with data skipping.

Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.

2. The discussion of which kinds of data are originally put into HBase was a bit confusing.

HBase is commonly used to receive machine-generated data. Everybody knows that.
Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.

OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.

3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that:

Pinterest has a large HBase database on SSDs (Solid-State Drives), a large fraction of which is actually in RAM.
eBay has an HBase database largely on spinning disk, used to inform its search engine.

Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data. That sounds like multiple terabytes per node even before considering a replication factor, so I presume it’s disk-based as well. The takeaway from those examples, other than general big-use-case impressiveness, is that storage choices for HBase can vary greatly by user and application.

4. HBase has master/master geographically remote replication. I gather that Yahoo replicates between a couple of 1000-node clusters, on behalf of its Flurry operation. HBase also has the technical capability to segment data across geographies — i.e., the geo-partitioning feature essential to data sovereignty compliance — but no actual implementations came to mind.

5. Besides the ones already mentioned, and famed HBase user Facebook, a few other users came up.

It seems to be common for ad-tech companies to store in HBase the data that arrives from many different computers and mobile devices.
An agency that Cloudera didn’t name, but which is obviously something like the SEC or CFTC, stores all trade data in HBase.
Cerner — or perhaps its software — stores data in HBase on a patient-by-patient basis.

In general, Cloudera suggested that HBase was used in a fair number of OEM situations.

6. Finally, I have one number: As of January, 2014 there were 20,000 HBase nodes managed by Cloudera software. Obviously, that number is growing very quickly, and of course there are many HBase nodes that Cloudera has nothing to do with.

Related link

A lot of this echoes what I hear from DataStax (December, 2013), notwithstanding the consensus that HBase and Cassandra rarely compete in the marketplace.

Optimism, pessimism, and fatalism — fault-tolerance, Part 2

Curt Monash — Sun, 08 Jun 2014 16:58:35 +0000

The pessimist thinks the glass is half-empty.
The optimist thinks the glass is half-full.
The engineer thinks the glass was poorly designed.

Most of what I wrote in Part 1 of this post was already true 15 years ago. But much gets added in the modern era, considering that:

Clusters will have node hiccups more often than single nodes will. (Duh.)
Networks are relatively slow even when uncongested, and furthermore congest unpredictably.
In many applications, it’s OK to sacrifice even basic-seeming database functionality.

And so there’s been innovation in numerous cluster-related subjects, two of which are:

Distributed query and update. When a database is distributed among many modes, how does a request access multiple nodes at once?
Fault-tolerance in long-running jobs.When a job is expected to run on many nodes for a long time, how can it deal with failures or slowdowns, other than through the distressing alternatives:
- Start over from the beginning?
- Keep (a lot of) the whole cluster’s resources tied up, waiting for things to be set right?

Distributed database consistency

When a distributed database lives up to the same consistency standards as a single-node one, distributed query is straightforward. Performance may be an issue, however, which is why we have seen a lot of:

Analytic RDBMS innovation.
Short-request applications designed to avoid distributed joins.
Short-request clustered RDBMS that don’t allow fully-general distributed joins in the first place.

But in workloads with low-latency writes, living up to those standards is hard. The 1980s approach to distributed writing was two-phase commit (2PC), which may be summarized as:

A write is planned and parceled out to occur on all the different nodes where the data needs to be placed.
Each node decides it’s ready to commit the write.
Each node informs the others of its readiness.
Each node actually commits.

Unfortunately, if any of the various messages in the 2PC process is delayed, so is the write. This creates way too much likelihood of work being blocked. And so modern approaches to distributed data writing are more … well, if I may repurpose the famous Facebook slogan, they tend to be along the lines of “Move fast and break things”,* with varying tradeoffs among consistency, other accuracy, reliability, functionality, manageability, and performance.

By the way — Facebook recently renounced that motto, in favor of “Move fast with stable infrastructure.” Hmm …

Back in 2010, I wrote about various approaches to consistency, with the punch line being:

A conventional relational DBMS will almost always feature RYW consistency. Some NoSQL systems feature tunable consistency, in which — depending on your settings — RYW consistency may or may not be assured.

The core ideas of RYW consistency, as implemented in various NoSQL systems, are:

Let N = the number of copies of each record distributed across nodes of a parallel system.

Let W = the number of nodes that must successfully acknowledge a write for it to be successfully committed. By definition, W <= N.

Let R = the number of nodes that must send back the same value of a unit of data for it to be accepted as read by the system. By definition, R <= N.

The greater N-R and N-W are, the more node or network failures you can typically tolerate without blocking work.

As long as R + W > N, you are assured of RYW consistency.

That bolded part is the key point, and I suggest that you stop and convince yourself of it before reading further.

Eventually :), Dan Abadi claimed that the key distinction is synchronous/asynchronous — is anything blocked while waiting for acknowledgements? From many people, that would simply be an argument for optimistic locking, in which all writes go through, and conflicts — of the sort that locks are designed to prevent — cause them to be rolled back after-the-fact. But Dan isn’t most people, so I’m not sure — especially since the first time I met Dan was to discuss VoltDB predecessor H-Store, which favors application designs that avoid distributed transactions in the first place.

One idea that’s recently gained popularity is a kind of semi-synchronicity. Writes are acknowledged as soon as they arrive at a remote node (that’s the synchronous part). Each node then updates local permanent storage on its own, with no further confirmation. I first heard about this in the context of replication, and generally it seems designed for replication-oriented scenarios.

Single-job fault-tolerance

Finally, let’s consider fault-tolerance within a single long-running job, whether that’s a big query or some other kind of analytic task. In most systems, if there’s a failure partway through a job, they just say “Oops!” and start it over again. And in non-extreme cases, that strategy is often good enough.

Still, there are a lot of extreme workloads these days, so it’s nice to absorb a partial failure without entirely starting over.

Hadoop MapReduce, which stores intermediate results anyway, finds it easy to replay just the parts of the job that went awry.
Spark, which is more flexible in execution graph and data structures alike, has a similar capability.

Additionally, both Hadoop and Spark support speculative execution, in which several clones of a processing step are executed at once (presumably on different nodes), to hedge against the risk that any one copy of the process runs slowly or fails outright. According to my notes, speculative execution is a major part of NuoDB’ architecture as well.

Further topics

I’ve rambled on for two long posts, which seems like plenty — but this survey is in no way complete. Other subjects I could have covered include but are hardly limited to:

Occasionally-connected operation, which for example is a design point of CouchDB, SQL Anywhere (sort of), and most kinds of mobile business intelligence.
Avoiding planned downtime — i.e., operating despite self-inflicted wounds.
Data cleaning and master data management, both of which exist in large part to fix errors people have made in the past.

Related links

Uninterrupted DBMS operation (September, 2012)
The cardinal rules of DBMS development (March, 2013)
Bottleneck Whack-A-Mole (August, 2009)

Hardware and storage notes

Curt Monash — Thu, 01 May 2014 02:05:16 +0000

My California trip last week focused mainly on software — duh! — but I had some interesting hardware/storage/architecture discussions as well, especially in the areas of:

Rack- or data-center-scale systems.
The real or imagined demise of Moore’s Law.
Flash.

I also got updated as to typical Hadoop hardware.

If systems are designed at the whole-rack level or higher, then there can be much more flexibility and efficiency in terms of mixing and connecting CPU, RAM and storage. The Google/Facebook/Amazon cool kids are widely understood to be following this approach, so others are naturally considering it as well. My most interesting of several mentions of that point was when I got the chance to talk with Berkeley computer architecture guru Dave Patterson, who’s working on plans for 100-petabyte/terabit-networking kinds of systems, for usage after 2020 or so. (If you’re interested, you might want to contact him; I’m sure he’d love more commercial sponsorship.)

One of Dave’s design assumptions is that Moore’s Law really will end soon (or at least greatly slow down), if by Moore’s Law you mean that every 18 months or so one can get twice as many transistors onto a chip of the same area and cost than one could before. However, while he thinks that applies to CPU and RAM, Dave thinks flash is an exception. I gathered that he thinks the power/heat reasons for Moore’s Law to end will be much harder to defeat than the other ones; note that flash, because of what it’s used for, has vastly less power running through it than CPU or RAM do.

Otherwise, I didn’t gain much new insight into actual flash uptake. Everybody thinks flash is or soon will be very important; but in many segments, folks are trading off disk vs. RAM without worrying much about the intermediate flash alternative.

I visited two Hadoop distribution vendors this trip, namely the ones who are my clients – Cloudera and MapR. I remembered to ask one of them, Cloudera, about typical Hadoop hardware, and got answers that sounded consistent with hardware trends Hortonworks told me about last August. The story is, more or less:

The default assumption remains $20-30K/node, 2 sockets, 12 disks. (Edit: See lively price discussion in the comments below.)
Most hardware vendors have standard/default Hadoop boxes by now, and in many cases customers just buy what’s on offer.
The aforementioned disks sometimes get up to 4 terabytes now.
128GB is now the norm for RAM. 256GB is common. Higher amounts are seen, up to – in rare cases – 2-4 TB.
Flash is of interest, but isn’t being demanded much yet. This could change when flash’s storage density matches disk’s.
Flash interest is highest for Impala.

Cloudera suggested that the larger amounts of RAM tend to be used when customers frame the need as putting certain analytic datasets entirely in RAM. This rings true to me; there’s lots of evidence that users think that way, and not just in analytic cases. This is probably one of the reasons that they often jump straight from disk to RAM without fully exploring the opportunities of flash.

One last thing — the big cloud vendors are at least considering the use of their own non-Intel chip designs, which might be part of the reason for Intel’s large Hadoop investment.

Cloudera, Impala, data warehousing and Hive

Curt Monash — Thu, 01 May 2014 02:03:51 +0000

There’s much confusion about Cloudera’s SQL plans and beliefs, and the company has mainly itself to blame. That said, here’s what I think is going on.

Hive is good at some tasks and terrible at others.
- Hive is good at batch data transformation.
- Hive is bad at ad-hoc query, unless you really, really need Hive’s scale and low license cost. One example, per Eli Collins: Facebook has a 500 petabyte Hive warehouse, but jokes that on a good day an analyst can run 6 queries against it.
Impala is meant to be good at what Hive is bad at – i.e., fast-response query. (Cloudera mentioned reliable 100 millisecond response times for at least one user.)
Impala is also meant to be good at what Hive is good at, and will someday from Cloudera’s standpoint completely supersede Hive, but Cloudera is in no hurry for that day to arrive. Hive is more mature. Hive still has more SQL coverage than Impala. There’s a lot of legacy investment in Hive. Cloudera gets little business advantage if a customer sunsets Hive.
Impala is already decent at some tasks analytic RDBMS are commonly used for. Cloudera insists that some queries run very quickly on Impala. I believe them.
Impala is terrible at others, including some of the ones most closely associated with the concept of “data warehousing”. Data modeling is a big zero right now. Impala’s workload management, concurrency and all that are very immature.
There are some use cases for which SQL-on-Hadoop blows away analytic RDBMS, for example ones involving data transformations – perhaps on multi-structured data – that are impractical in RDBMS.

And of course, as vendors so often do, Cloudera generally overrates both the relative maturity of Impala and the relative importance of the use cases in which its offerings – Impala or otherwise – shine.

Related links

A survey of SQL/Hadoop integration (February, 2014)
The cardinal rules of DBMS development (March, 2013)

DataStax/Cassandra update

Curt Monash — Sun, 08 Dec 2013 18:06:01 +0000

Cassandra’s reputation in many quarters is:

World-leading in the geo-distribution feature.
Impressively scalable.
Hard to use.

This has led competitors to use, and get away with, sales claims along the lines of “Well, if you really need geo-distribution and can’t wait for us to catch up — which we soon will! — you should use Cassandra. But otherwise, there are better choices.”

My friends at DataStax, naturally, don’t think that’s quite fair. And so I invited them — specifically Billy Bosworth and Patrick McFadin — to educate me. Here are some highlights of that exercise.

DataStax and Cassandra have some very impressive accounts, which don’t necessarily revolve around geo-distribution. Netflix, probably the flagship Cassandra user — since Cassandra inventor Facebook adopted HBase instead — actually hasn’t been using the geo-distribution feature. Confidential accounts include:

A petabyte or so of data at a very prominent company, geo-distributed, with 800+ nodes, in a kind of block storage use case.
A messaging application at a very prominent company, anticipated to grow to multiple data centers and a petabyte of so of data, across 1000s of nodes.
A 300 terabyte single-data-center telecom account (which I can’t find on DataStax’s extensive customer list).
A huge health records deal.
A Fortune 10 company.

DataStax and Cassandra won’t necessarily win customer-brag wars versus MongoDB, Couchbase, or even HBase, but at least they’re strongly in the competition.

DataStax claims that simplicity is now a strength. There are two main parts to that surprising assertion.

DataStax claims that operation is simple, that operators are “bored”, that large users appreciate the ease of operation, and so on. These claims become a lot more plausible if you recall:
- Cassandra isn’t used for databases that resemble relational schemas with 1000s of tables, lots of foreign keys, and so on.
- Performance and capacity problems in Cassandra don’t necessarily require sophisticated operational solutions; you can throw hardware at them instead.
DataStax claims that CQL (Cassandra Query Language) makes Cassandra programming and data modeling much easier than they were before. More on that below.

DataStax claims that Cassandra excels at time series use cases, where “time series” seem to equate to collections of short records with timestamps. This seems borne out by, for example, the first three use cases on my bulleted list above. Actually, it’s not just timestamps, but rather any data that is naturally ordered by a sequential field, such as packet IDs from a packet-switching network.

Finally, DataStax claims that Cassandra is good for high-velocity applications in general. A generic example that DataStax supported with some Very Big Names — whether those were of customers or prospects wasn’t entirely clear — was in retailing, to actually serve accurate information as to whether inventory is in stock, something Walmart failed at as recently as last year.

Now let’s talk a bit about Cassandra technology. I’ll start with an example. Imagine a “phone-home” use case in which many devices emit many records each in the form of (DeviceID, TimeStamp, MeterReading) triples.

A relational database would store that as a bunch of rows, 3 columns wide.
A Cassandra database, however, would have a single row for each DeviceID; each row would contain two columns for each (TimeStamp, MeterReading) pair.
The column names are composite, in a way that shows the different column pairs are each recording the same kind of thing.
Cassandra Query Language (CQL) lets you query (or insert) as if the data were in the relational-table logical format. But of course you can also reference Cassandra in a way that takes its actual (row, column) structure at face value.

So in essence, you have schemas that at once are dynamic and tabular. The big downside vs. a relational DBMS is that — duh! — you can’t have the benefits of normalization.

For clarity, I should note that much of Cassandra’s logical architecture is shared by fellow BigTable-architecture data store HBase; it’s not a coincidence that Facebook invented Cassandra to support messaging, nor that when Facebook changed its mind about that, it adopted HBase as the alternative. Accumulo has similar characteristics as well.

Physically, what’s going on in Cassandra is something like this:

Each Cassandra row is maintained in memory, and in most cases sorted on timestamp (or some other comparator), in either order. This is the basis for the claims of great Cassandra performance and general suitability specifically in time series use cases. (E.g., “Last 10 events” kinds of reads are very easy.)
Once rows are flushed to disk, they are immutable … except that of course they eventually are compacted, typically via a merge sort. (When you do need to do a database update, last write wins.)
Rows are organized into files on disk. There’s a “key cache” that in many cases will tell you exactly which file contains the row you’re looking for. If you have a cache miss …
… each file has a Bloom filter predicting which keys it contains, and you interrogate those. Those Bloom filters are also maintained in memory (and copied on disk just for the sake of persistence).

Cassandra has few indexes, and no physical concept of datatype.

The benefits I see to this physical architecture are mainly:

Plays nicely with Cassandra’s logical architecture.
Plays nicely with scale-out.
Seems to have been designed RAM-first, which matches how databases are actually used.
Is fast for range queries on the comparator (e.g. timestamp).
Doesn’t have a lot of knobs to twiddle, which makes it plausible that a relatively immature product can be easy to administer.

For some use cases, that’s not a bad list of advantages. Not bad at all.

Related link

I covered some real basics in a Cassandra technical overview 3 1/2 years ago.
WibiData Kiji’s most fundamental goal — there are others too — is to tame HBase data modeling much as CQL tames Cassandra’s.