February 9, 2011

Clarification on dbShards’ shard replication

After I posted recently about dbShards, a Very Smart Commenter emailed me with the challenge “but each individual shard is still replicated via two-phase commit, and everybody knows two-phase commit is fundamentally slow.” I replied that no, it wasn’t exactly two-phase commit, but fumbled the explanation of why — so I decided to escalate straight to dbShards honcho Cory Isaacson. Cory’s clarification, lightly edited as per his permission, was:

We are far faster than a 2PC, because we found that if we make the transaction log reliable (our own), then we can go much faster than the actual database engine. We call this approach “out of band” replication, because we replicate the transaction log, and then asynch writes to the secondary database (usually only milliseconds behind). This way a single shard with two servers can handle 1000s of writes/second; again, we can go much faster than the typical database. This link shows a diagram of how it works.

Typically our overhead is around 10%, meaning that we perform with full reliable replication at 90% the speed of an unprotected database. In some cases we are actually faster than a standalone database, because we are very efficient at autoincrement and other internals.

The key paragraph in Cory’s link is probably:

As each SQL statement is executed in the application, the dbShards driver asynchronously sends the SQL to the primary replication agent in parallel to executing the SQL against the primary database. The primary replication agent asynchronously sends the SQL to the secondary replication agent. When the application issues a commit statement, the driver performs a synchronous communication with the primary replication agent to ensure that the transaction details have been received by both the primary and secondary agent. Once the driver receives an ACK then the transaction is committed to the database.

As I read that, it is indeed a kind of two-phase commit protocol. However, the remote phase isn’t being handled by the remote DBMS, but rather by dbShards, and it doesn’t involve checking whether the second copy of the transaction will succeed. So Cory’s claims of performance very different from those of regular 2PC are not ridiculous.


13 Responses to “Clarification on dbShards’ shard replication”

  1. dbShards update | DBMS 2 : DataBase Management System Services on February 9th, 2011 4:37 am

    [...] In essence, two-phase database commit is replaced by two-phase log synchronization. Once it’s known that the transaction logs on two machines agree, the underlying DBMS is responsible for getting each of the databases updated. Edit: A subsequent post clarifies dbShards’ shard replication. [...]

  2. Mark Callaghan on February 9th, 2011 10:39 am

    What happens in this case when a client requests commit?
    1) client sends request to primary replication agent
    2) I assume the primary replication agent does something to get an ACK from secondary replication agent
    3) Oops, client connection to primary database is closed (network blip, connection timeout, whatever) and transaction is rolled back

    At this point the transaction was committed to the replica but not the primary. Is this possible? How is recovery done?

  3. Cory Isaacson on February 9th, 2011 1:22 pm


    Thanks for the post. You are correct in your assumptions, normal 2PC is very slow, because it relies on both databases managing the transaction. Our approach is very fast as you describe, plus its scalable as well (as we add shards, we add primary/secondary servers, so everything is linear for write speeds).

    I will also say that of all the challenges we had with dbShards, this was the toughest to get right. But we did indeed make it work, its well-proven in serious production applications now with our customer base.


  4. Cory Isaacson on February 9th, 2011 1:25 pm


    Very good question.

    Here is a more detailed explanation of the process:

    1) The client sends the request to both the primary agent and the primary database in parallel
    2) Everything is streamed by the primary agent to the secondary agent
    3) The client sends a pre-commit to the primary agent, which goes to the secondary and returns an ACK (the only time we need a synchronous ACK)
    4) Send a COMMIT to the primary database
    4) If the primary database fails (at any point, not just the commit), we send a ROLLBACK to the primary agent, and the transaction is ignored by the secondary agent/database
    5) If the primary database succeeds, we send a COMMIT to the primary agent, and the transaction is written to the secondary database

    The reasons we can do this very efficiently are:

    a) The pre-commit is a very small message, and is used to ensure that all SQL statements made it to the secondary. Plus this is the only sequence that requires an ACK, everything else is asynch and TCP optimized.
    b) The actual COMMIT or ROLLBACK to the agent is done in a lazy fashion as we know we have the entire transaction secure on the primary and secondary agent. Therefore, no additional delays are added at this point, the application has already moved on to its next action.

    Typically the lag to a write to the secondary database is milliseconds, so it works very well in practice.

    Let me know if that clarifies it.


  5. Mark Callaghan on February 9th, 2011 2:27 pm

    What happens to a transaction when it was ACK’d by the primary agent, committed on the primary but never committed to the the primary agent? Is there a journal table on the primary to record transaction outcomes?

  6. Cory Isaacson on February 9th, 2011 6:00 pm


    We do maintain our own transaction log, so that journals all transactions. I’m not sure I fully understand your question, but basically if it commits to the primary database, then the commit is issued to the primary agent and everything flows through.


  7. Mark Callaghan on February 9th, 2011 9:01 pm

    Whatever does the commit to the primary database can crash before doing a commit to primary agent. How is recovery done in that case?

  8. Cory Isaacson on February 10th, 2011 10:59 am

    Similar to other 2PC operations, we mark the transaction suspect. There are several options for recovery at that point, including automatic verification of the transaction in the primary database, or administrator decision for an ambiguous UPDATE statement. dbShards logs everything that occurs so that it can be corrected.

  9. Daniel Abadi on February 10th, 2011 3:10 pm

    This technique is called “log-shipping”. It has been known to work well by the database community for many years and has been integrated into several database replication products. In my opinion, it is safe to assume that dbShards has implemented this correctly.

    It works great for single-partition transactions (like what dbShards does). When you get to multi-partition transactions, the story gets more complicated.

  10. Cory Isaacson on February 10th, 2011 10:50 pm


    Thanks for the feedback. dbShards actually logs each transaction across two servers (in-memory before a commit, then to disk).

    I found this description of Log Shipping (MS SQL and Postgres both support this), and it’s different in that with this approach you ship entire log files between the servers, not individual transactions. I do know with Postgres you can ship a log of 1 transaction, but that is much slower than dbShards replication.


    With dbShards we do use something similar to log shipping for a recovery operation, but not for routine operation.


  11. Daniel Abadi on February 11th, 2011 10:11 am


    You say:

    “We are far faster than a 2PC, because we found that if we make the transaction log reliable (our own), then we can go much faster than the actual database engine. We call this approach “out of band” replication, because we replicate the transaction log, and then asynch writes to the secondary database (usually only milliseconds behind).”

    From this it clear that you are saying that you are replicating the transaction log. By “replicating the transaction log”, either you mean that you are replicating the list of individual transactions input to the system (which means the replicas will diverge if you don’t run each replica in single-threaded mode (which is very slow) or you mean you are replicating the log of actions that the database took to apply a transaction, in which case you are doing log shipping, which is how database systems have dealt with replication for decades.

  12. Aaron on February 11th, 2011 1:53 pm

    This seems to me to miss the point entirely. The classic primary db is fast because the DBMS allows parallel txn activity and consolidates multiple threads to a common log, where sequential writes are fast and aggregate many txns.

    2PC is not slow, but adds handshake latency at the txn level – and there are typically a lot of these adding aggregate delays, especially to small txns.

    This model seems to optimize by removing ordering guarantees, removing the sequential apply bottleneck at the replicant. This is a popular NoSQL tradeoff.

    Is the replicant supposed to be ACID consistant with the source?

  13. Are there any remaining reasons to put new OLTP applications on disk? | DBMS 2 : DataBase Management System Services on September 19th, 2011 1:38 pm

    [...] Local high availability. [...]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.